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New biodiversity 
targets cannot 
afford to fail 


Global goals to protect natural systems will be 
revised this year. China must help to ensure the 
new targets are measurable and meaningful. 


ost measures of biodiversity suggest 
that things are going badly wrong. Some 
one million plant and animal species face 
extinction, according to the Intergovern- 
mental Science-Policy Platform on Bio- 
diversity and Ecosystem Services (IPBES). And French 
President Emmanuel Macron last week called the battle for 
biodiversity and climate change “the fight of the century”. 

Adecade ago, countries united to create a10-year plan, 
sub-divided into 20 targets, for protecting and conserving 
natural systems. The plan, also known as the Aichi Biodi- 
versity Targets, expires at the end of this year — and most 
of the targets will not have been reached. Between 24 and 
29 February, representatives of the international commu- 
nity will meetin Rome to discuss anew plan. A lotis at stake, 
and it’s vital that the world unites behind the effort. 

The meeting will consider a draft of an updated set of 
global goals, which must be agreed by the summer. Then, 
in October, world leaders will gather in Kunming, China, 
for the Conference of the Parties to the United Nations 
Convention on Biological Diversity. China will be in the 
chair, the first time it will lead on aconference of the parties 
to one of the ‘big two’ global environmental conventions 
(see News page 345). 

These discussions are as important to biodiversity as 
the Paris agreement was to climate, and are likely to be 
similarly fraught. Conservation groups back more strin- 
gent and more measurable targets. European countries sit 
somewhere between the United States — which has long 
refused to sign the biodiversity convention — and devel- 
oping countries, which will be looking to China to fight 
their corner. But China’s efforts to build consensus have 
been set back by the coronavirus, which has seen parts of 
the country closed down. 

To be fair, not every biodiversity policy has failed. Among 
the hard-won achievements is the 2014 Nagoya Protocol, 
anagreementstating that the benefits of genetic resources 
must be shared equitably among all of those — including 
Indigenous communities — who have contributed to their 
development. This can take time, and the World Health 
Organization has been discussing howto reduce potential 
delays when genetic information needs to be shared in 
public-health emergencies. But the protocol’s existence isa 
win for multilateral science and environmental diplomacy. 

By contrast, there’s been no clear progress on the 
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headline ambition to slow and eventually reverse the loss 
of biological diversity around the world. 

The Aichi targets failed, in part, because their format 
makes progress hard to measure. Ahead of this year’s talks, 
a group of researchers led by Elizabeth Green at the Cen- 
tre for Conservation Science in Sandy, UK, scanned the 
literature for mentions of the Aichi targets since 2010. The 
team then invited an expert group to score the targets on 
ascale of one to ten. All of the targets scored highly for 
being comprehensive, but most scored relatively poorly 
onbeing measurable and realistic (E.J. Green et. al. Conserv. 
Biol. 33, 1360-1369; 2019). 

Take the first target, intended to ensure that “people are 
aware of the values of biodiversity and the steps they can 
take to conserve and use it sustainably”. It’s clear this aims 
to raise public awareness of and engagement with biodi- 
versity, but it’s not clear when success has been achieved. 

Those drawing up anew generation of biodiversity goals 
and targets understand this. The text of a new draft released 
last month contains spaces in square brackets, ready to be 
filled in when more-quantitative measures are agreed. Such 
measures include ensuring strict protection for impor- 
tant ecosystems and finding nature-based solutions that 
increase resilience to natural disasters (see Comment 
page 360). 


Ambition versus achievement 


The Aichi targets didn’t fail solely because they weren't 
measurable. They also failed because countries did not 
need to report what they were doing to achieve them. 

The biodiversity convention’s member states have to 
publish biodiversity action plans — but these are often 
statements ofa country’s ambitions, rather than records of 
its achievements. For the next set of goals this has to change, 
and fortunately there seems to bea way forward. This is the 
UN System of Environmental Economic Accounting (SEEA), 
amechanism for reporting environmental data, and it needs 
to become the global standard for environmental reporting. 

SEEA was adopted in 2012 to encourage countries’ 
national statistical offices to take responsibility for col- 
lecting and reporting environmental data. Asking statistics 
offices to do this was a stroke of genius. These offices are 
already responsible for reporting national economic data 
to the UN. They work to the best available standards and 
strict deadlines — and they get the job done. Charging them 
with reporting environmental data ensured that these data 
would be treated in the same way. 

What beganasa trickle of countries following the system 
has surged to more than 80 states sending updates to the 
UN ona multitude of environmental indicators, from the 
state of their forests to the state of their fisheries. Develop- 
ing countries will need to be supported to get up to speed 
and contribute their own ideas. But the die is cast. 

Asis sometimes the case with the UN, alack of joined-up 
thinking allowed SEEA to emerge independently of other 
indicators, such as the Aichi targets and the Sustainable 
Development Goals (SDGs). Now, moves are under way 
towards some harmonization. Last July, the UN published a 
global indicator review (go.nature.com/2ssazbc) in which 
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researchers confirmed that countries could use SEEA to 
report 34 of the 147 Aichi target indicators and 21 of the 
230 SDG target indicators. This is an important start, but 
also indicates how much needs to be done before more 
goals and targets can be reported using the SEEA frame- 
work — an opportunity which researchers must not pass up. 

Measuring and reporting numerical targets, although 
vital, is not the whole story. If the world is to understand 
why the Aichi targets failed — and improve on them — it 
must assess the broader obstacles. 

Oneis the historical tension between development and 
the environment — andthe expectation of poorer countries 
that they should be able to develop, just as richer countries 
did. There is also a perception that new environmental 
standards will hold them back. No one can contest their 
case for developing, but, considering the state of the 
planet, their concerns need to be met through greener 
development. They need support to provide their citizens 
with basic amenities — such as clean water, nutrition and 
power — ina way that is sustainable and protects future 
generations. This means making significant changes to 
how economic decisions are made. 


No contest 


Usually, in any contest between industrial growth and the 
preservation of species and ecosystems, growth comes 
out ontop. Biodiversity is rarely allowed to stop or delaya 
new airport runway or power plant. Ifa wetland needs to be 
concreted over to make way for a housing development, in 
many countries it has little chance of being protected, even 
though losing the wetland means sacrificing the services 
it provides to people — suchas wildlife habitats and flood 
defences. These services are rarely quantified. 

Fortunately, researchers and policymakers globally are 
taking a stronger interest in valuing biodiversity’s contri- 
bution to economies and to societies. IPBES is deep ina 
project that will advise countries on the many ways to value 
biodiversity; a report is due to be presented next year. And 
last year, the UK Treasury launched its own independent 
review, chaired by the economist Partha Dasgupta of the 
University of Cambridge, that is due to report in time for 
the biodiversity conference in China. 

We know that working in an economic and financial sys- 
tem that places little value on the natural world will make 
it difficult to meet goals in biodiversity and sustainable 
development. That’s why it is prudent to tackle smaller 
aspects of the system — at least for now. At the same time, 
it’s imperative that the new biodiversity goals find syner- 
gies and avoid conflicts with the Paris climate agreement 
and the SDGs, neither of which existed a decade ago. 

The road to the Kunming convention will be long and 
complicated. This is inevitable, both because life on Earth 
is itself beautifully complex, with so many global systems 
influencing biodiversity, and because the outcomes mat- 
ter. Humanity’s future depends on our ability to protect the 
planet. Greater awareness of threats to the natural world 
— perhaps an intangible impact of the Aichi targets — has 
created amomentripe for action. The challenge will be to 
keep the devil in the detail from derailing the process itself. 
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The final countdown 


The United Kingdom’s Research Excellence 
Framework might turn out to be the last. 


t was the day most UK academics were dreading. On 
Monday 17 February, funding agencies fired the start- 
ing gun on the next Research Excellence Framework 
(REF 2021), the United Kingdom’s system for evaluat- 
ing research quality. 

Universities have until 27 November to submit their 
researchers’ outputs to the REF. These will then be graded 
by review panels ona scale of 1 to 4 — the highest score 
meaning that the work is deemed “world leading” in its 
originality, significance and rigour. 

A lotis riding on the outcome because funders use the 
results to allocate around £2 billion (US$2.6 billion) in 
annual research funding to university departments. Most 
institutions will want to see their academics graded inthe 
top two bands, because lower-performing departments 
are unlikely to get much money at all. 

The exercise is valuable in providing public accounta- 
bility for research spending while protecting universities’ 
financial autonomy. But many researchers and research 
managers are wondering whether REF 2021 could be the last. 

Many would not mourn the REF’s demise. By coincidence, 
from 20 February thousands of UK academics will be on 
strike for 14 days, calling for better pay and more-secure 
pensions. The constant monitoring of performance that 
comes with research evaluation is also mentioned by aca- 
demics as a source of stress and anxiety. 

The REF isalso not cheap to administer — the 2014 exercise 
cost around £246 million. Andas with most indices, the REF’s 
overlords keep having to make changes to prevent it from 
being gamed. In the past, departments were able to achieve 
high scores by submitting outputs from a fraction of their 
best-performing staff— something that is nolonger allowed. 

Universities that obtain the most REF-based funding are 
concentrated in Londonand southeast England, and this has 
fuelled arguments that the metric’s funding formula helps to 
reinforce the UK’s regional imbalance. That alone could be 
an argument for radical reform froma government looking 
to level up funding to other parts of the United Kingdom. 

That said, the REF’s critics need to be careful what they 
wish for, because the framework protects money that uni- 
versities rely on to pay salaries andto keep the lights on. The 
government of Prime Minister Boris Johnsonis also looking 
to cut funding from publicly funded bodies that have oper- 
ated largely autonomously from the state — including the 
national broadcaster, the BBC. Moreover, proposals for 
research funding reforms are widely expected this year. 

A bonfire of the REF might well appeal to many, but 
not if the outcome leads to cuts, or reduced autonomy 
for institutions. There could be a wiser option: adjust the 
REF’s funding formula so that money for the best work is 
distributed more fairly across the United Kingdom. 
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MAGGIE RYAN SANDFORD 


A personal take on science and society 


World view 


By Maggie 
Ryan Sandford 


You can't fight feelings with 
facts: start with a chat 


I donned a sandwich board inviting questions 
on evolution and learnt three crucial lessons 
about public engagement on divisive issues, 
writes Maggie Ryan Sandford. 


went to the Minnesota State Fair last year wearing 
a sandwich board. It said, “Ask me anything about 
evolution.” Proponents of evolution assumed I was a 
religious zealot. Creationists assumed I was there to 
mock their beliefs. The biggest challenge in fighting 
misinformation? Just getting a conversation started. 

This public-engagement stunt taught me a crucial 
lesson: the key to effective science communication isn’t 
the science. It’s communication. 

Attendees had come to show off prize livestock, eat corn 
dogs and ride the Ferris wheel, not get angry about some- 
one who disagrees with them about the origin of life on 
Earth. Most folks wouldn't stop to talk unless I passed what 
Icame to recognize as ‘the first test’. Some would call out, 
without slowing: “Do you believe in evolution?” Others, 
“Do you believe in God?” 

Part of me died each time! answered witha profoundly 
un-nuanced “Yes!” But, as ascience communicator and 
former education researcher, I knew that, in matters of 
deep personal belief, facts matter less than feelings. The 
need to identify whom you're dealing with is a natural 
human instinct. Answering was the only way to unlock 
the rest of the conversation. Sol simply let people know! 
was a big fan of the globe and everything on it, and that I’d 
written a book about animals that I hoped people would 
find inviting. 

Inthe cowbarn, a manwith his pre-teen daughter smiled 
at me, then avoided my gaze. “We don’t believe in evolution, 
so...” “Okay!” I called back. “Did you want to talk about 
cows?” I pointed out how humans can build muscle by eat- 
ing cow protein, because of our shared ancestry. We know 
that ‘relevancy’ is crucial to public understanding — people 
need takeaways that relate to their everyday lives. “Well, 
that’s awesome,’ he said, “because! do lovea good steak!” 
Before he and his daughter walked away, we exchanged 
thumbs ups. 


Lesson 1: Don’t argue with beliefs. People tend to 
incorporate facts that align with their belief systems. 

No problem. I just had to find topics that made sense to 
all of us — pro-and anti-evolution alike. Dogs or livestock 
breeding, for example. Half the folks within a 30-metre 
radius were there to showcase their carefully bred cows, 
horses and chickens. Open-faced and genuine, | invited 
them to school me on the areas of their expertise. Which, 
it turns out, is evolution. 
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Lesson 2: Listen. The most challenging group of the day 
consisted of two men and a woman in their late twenties. 
The men were just looking for a fight. Telling me why 1 was 
wrong was, I supposed, a way of asking me about evolution. 
lasked them to elaborate, to tell me why it was that they 
found evolution hard to swallow. This led to their female 
companion insisting: “She listened to you. Now you listen 
to her.” In the end, one man explained my points to the 
other. “She’s saying evolution is mutations in our DNA,” 
he said, forcing his companion to let him finish. “I’m just 
saying, I get her side.” 


Lesson 3: Learn what people really think. Almost everyone 
— secular and religious — had misconceptions about 
evolution. Advocates of evolution often hadn't learnt that 
evolution can now be tracked in genomes, not just fossils, 
and that humansare related to all living things, and that we 
didn’t come from apes because we are apes (keep in mind, 
‘ape’ is a word that humans made up). 

But the misconceptions of religiously inclined folks 
often had greater personal significance. Listening to them, 
it became clear that they considered evolution an attack on 
all they held dear. Several asked me about a narrative they'd 
heard somewhere about how “life began when water was 
dripping ona rock”. Clearly, they were worried that such 
a narrative undercut the idea that humans were created 
in the image of God. 

People from both groups often misinterpreted the term 
‘survival of the fittest’, and were surprised to hear that 
evolution isn’t a system of improvement, just a system of 
change. And that On the Origin of Species was not intended 
as an attack on faith. Even in old age, Darwin declared: “I 
have never been an atheist.” 


Lay people are more likely to trust and engage with science 
when they learn that researchers are human beings, fallible 
and conflicted. Yet somehow it seems hard for many inthe 
scientific community to show those qualities to others. A 
commonconcernis that, inthe anti-evolution, anti-science 
debate, any whiff of disagreement or uncertainty spells 
doom for scientific arguments. 

When I began this ‘experiment’, my hypothesis was 
that a willingness to show vulnerability — to show that 
we science folks are willing to listen and receive criticism 
— boosts credibility, not the opposite. I think my experi- 
ence supports that. When feelings speak louder than facts, 
appealing to feelings can actually work in favour of science. 

No matter where we think we fall in the evolution debate, 
all of us are human, and we evolved to read each other’s 
facial expressions and tones of voice, to be together. 
Returning to our humble, apish roots is the only way to 
see anti-science sentiment go the way of the dodo. 
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The world this week 


Newsin brief 


ANIMAL-RESEARCH DATA SHOW 
EFFECTS OF EU'S TOUGH REGULATIONS 


Scientists in the European 
Union seem to be using fewer 
animals for research, according 
to statistics gathered by the 
European Commission. The 
figures come from the first 
report on the state of animal 
research in the bloc since 
the introduction of tougher 
regulations seven years ago. 
The report — published on 
6 February — reviews the impact 
of an animal-research directive, 
legislation that was designed 
to lessen the use of animals in 
research and minimize their 
suffering. The directive is 
widely considered to be one of 
the world’s toughest on animal 
research. 
According to the report, 
9.39 million animals were 
used for scientific purposes in 
2017 — the most recent year for 
which data have been collated — 
compared with 9.59 million 
in 2015. From 2015 to 2016, 
however, there was a slight 
increase, to 9.82 million. The 
report acknowledges that this 
prevents the confirmation of a 
clear decrease. But it adds that, 
when compared with figures 
from before the directive came 
into force, the numbers suggest 


“a clear positive development”. 

In 2017, more than two-thirds 
of instances of animal use were 
in basic or applied research 
(45% and 23%, respectively), 
and around one-quarter (23%) 
involved the testing of drugs 
and other chemicals to meet 
regulatory requirements. 
Other uses included the routine 
production of biological agents 
such as vaccines; teaching; and 
forensic investigations (see 
‘Animals in science’). 

The legislation sets out high 
standards for the housing and 
care of animals, and promotes 
methods that cause the least 
pain and use a minimal number 
of animals. It requires member 
states to submit detailed 
data, including the number 
and species of animals used 
in research, as well as the 
number of times each animal 
is used, and the purpose and 
severity of experimental 
procedures. 

Aspokesperson for the 
European Commission says that 
such detailed data “allow us to 
identify far more effectively 
where best to target resources 
to help reduce the number and 
suffering of animals”. 


ANIMALS IN SCIENCE 


In 2017, more than two-thirds of recorded instances of animal 
use in the European Union were in basic or applied research. 


Translational and 
applied research 23% 


Regulatory use 23% 


Routine manufacture of 
medical products 5% 


2017 is the latest year for which data are available. 
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Other 4% 


CORONAVIRUS 
NAME PROMPTS 
CONTROVERSY 


The disease caused by the new 
coronavirus now has an official 
name: COVID-19. 

The disease, and the virus, 
had been going by anumber of 
monikers, including 2019-nCoV, 
since they emerged in Chinain 
December. The World Health 
Organization (WHO), which 
announced the new name on 
11 February, said that it chose 
one that did not refer toa 
geographical location, an animal 
or a group of people, to avoid 
stigma. 

On the same day, a group with 
the International Committee 
on Taxonomy of Viruses, which 
is responsible for naming 
the pathogens, designated 
the virus itself SARS-CoV-2 
(A. E. Gorbalenya et al. Preprint 
on bioRxiv http://doi.org/dmsh; 
2020). The group said that 
this term highlights the new 
virus’s similarity to SARS-CoV, 
the coronavirus identified in 
2003 that causes severe acute 
respiratory syndrome. 

But the virus name caused 
consternation, particularly 
among Chinese virologists, who 
worry that it will confuse the 
public and impede efforts to 
control the pathogen’s spread. 
Although the two viruses belong 
tothe same species, the new 
coronavirus spreads faster than 
SARS-CoV but is less deadly, says 
Shibo Jiang, a virologist at Fudan 
University in Shanghai. The new 
coronavirus has infected more 
than 73,000 people. 
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INFLUENTIAL CLIMATE 
CHIEF DIES 


Rajendra Pachauri, an Indian 
environmentalist and former 
head of the Intergovernmental 
Panel on Climate Change 
(IPCC) who had been accused 
of sexual harassment, died on 
13 February. He was 79 and his 
death followed recent heart 
surgery, according to media 
reports. 

From 2002 to 2015, Pachauri 
was chair of the IPCC — the 
international organization that 
produces scientific reports on 
the state of climate change and 
developed the Paris agreement 
to halt global warming. In 
2007, during his tenure, the 
organization received the Nobel 
Peace Prize. 

Born in 1940 in Uttarakhand 
state, Pachauri studied 
engineering and economics in 
India and the United States. He 
became director of the Energy 
and Resources Institute (TERI), 
aclimate and energy-policy 
institute based in New Delhi, 
in 1981. He received several 
civilian honours from the Indian 
government. 

But in 2015, he stepped down 
as chief of the IPCC and from 
TERI’s leadership after a female 
colleague accused him of sexual 
harassment. Pachauri denied the 
accusations; a case was pending 
ina Delhi court at the time of his 
death. 


Sign up to get essential science 
news, opinion and analysis 
delivered to your inbox daily. 
Visit go.nature.com/newsletter 
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UK SCIENCE 
MINISTER OUSTED 


A government reshuffle last 
week ousted the UK minister 
for universities and science, 
Chris Skidmore. 

Skidmore had occupied the 
position — which has seen a 
revolving door of appointees 
and aseries of resignations in 
recent years — for two periods 
since 2018, and was popular with 
academics. 

As Nature went to press, it 
was not clear who would take 
onthe ministerial briefs for 


The distant Solar System object known as Arrokoth universities and science. But 
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The world this week 


News in focus 


The giant panda (Ailuropoda melanoleuca) was upgraded from endangered to vulnerable in 2016. 


CHINA TAKES CENTRE 


STAGE INMAJOR 


BIODIVERSITY PUSH 


A United Nations summit could see China press for ambitious 


targets, and spotlights the country’s own conservation efforts. 


By Smriti Mallapaty 


he world’s species and natural 

ecosystems are in crisis. When nearly 

200 countries gather next week to 

thrash out a major planto stem the pre- 

cipitous decline, China is expected to 

take a prominent role. The high-stakes negoti- 

ations will set the stage for a major biodiversity 

summit in October, which the country will host 

— marking the first time the nation will lead 
global talks on the environment. 

That role, together with China’s growing 


global influence — including its vast Belt and 
Road Initiative to build international infra- 
structure — has put a spotlight on its impact 
on, and efforts to preserve, biodiversity. 

“We are familiar with China being part of the 
problem of the global environmental emer- 
gency. For the sake of nature and the people 
living on this planet, there is aneed to turn 
China into part of the solution,” says Li Shuo, 
apolicy adviser at Greenpeace China in Beijing. 

The gathering on 24-29 February was orig- 
inally planned to take place in China, in the 
city of Kunming, but following the outbreak 
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of coronavirus in December, it has been moved 
to Rome. 

The meeting is the second of three rounds of 
talks in which nations decide biodiversity tar- 
gets that will form the basis of a legally binding 
global agreement. That will be signed at the 
15th conference of the parties to the United 
Nations Convention on Biological Diversity 
(CBD) in Kunming in October, and will replace 
the current accord, signed in 2010. 

The CBD global agreements are the main 
mechanism for holding signatories respon- 
sible for protecting biodiversity for the next 
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ten years, says Basile van Havre, co-chair of 
the CBD. More than190 countries have signed 
the treaty. 

The stakes are particularly high this time, 
because countries have largely failed to meet 
the 2020 deadline for the current CBD goals, 
such as preventing species extinctions and 
ensuring that all fish stocks are harvested 
sustainably, says Li. 

Ecosystems are vanishing rapidly, and close 
to one million plant and animal species face 
extinction. If this trend continues, it could 
have big consequences for people and food 
production, says van Havre. Countries have 
failed to meet the current goals partly because 
the targets were vague and difficult to imple- 
ment, and progress hard to track, he says. 

As host, China will take over presidency of 
the CBD and lead negotiations at the October 
summit. The country’s diplomats will have the 
crucial job of nudging the world to agree to 
ambitious goals that can be measured, say 
researchers. At the February meeting, China 
is expected to take a prominent role in nego- 
tiations over amendments to a paper, known 
as the zero draft, that will form the basis of the 
accord to be signed in October. 

China has made significant progress on 
environmental issues at home, which will 
give its diplomats some clout, says Li. Over 
the past decade, the central government has 
established thousands of nature reserves and 
parks, andit is drawing up ecological ‘red lines’ 
to restrict human and industrial activity over 
about one-quarter of the country. And in 2016, 
the conservation status of the giant panda 
(Ailuropoda melanoleuca) was upgraded from 
endangered to vulnerable. “Hosting this con- 
ference is one way of demonstrating that China 
is getting more serious on environmental 


protection,” says Li. 

But China has also had high-profile failures. 
Last December, a group of scientists said the 
giant paddlefish (Psephurus gladius) that 
swam in the waters of the country’s Yangtze 
River was extinct (H. Zhang et al. Sci. Total 
Environ. 710, 136242; 2020). And no one has 
seena Yangtze River dolphin (Lipotes vexillifer) 
since 2002; the species might also be extinct. 

For China to truly lead when it comes to 
biodiversity, it will need to address its signif- 
icant and rising ecological footprint outside 
its borders, says Aleksandar Rankovic, an 
environmental scientist at the Institute for 
Sustainable Development and International 
Relations in Paris. 


“This conference is one way 
of demonstrating that China 
is getting more serious on 
environmental protection.” 


The country imports large volumes of 
environmentally damaging products suchas 
palm oil, which contributes to deforestation 
in tropical countries including Indonesia. And 
although trends are hard to estimate, much 
of the world’s illicit wildlife trade is widely 
understood to be driven by Chinese demand 
for traditional medicines and delicacies. 

LiZhang, aconservation biologist at Beijing 
Normal University, thinks that China’s wild- 
life trade and consumption contributed to 
the current coronavirus outbreak. The virus 
is thought to have jumped to humans from 
animals, possibly pangolins, sold ina market 
in Wuhan, the epicentre of the outbreak. The 
national government has temporarily banned 


The Yangtze River dolphin (Lipotes vexillifer) is feared to be extinct. 
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wildlife markets. But Li says these measures 
are not enough to prevent the emergence of 
another infectious disease, and is calling fora 
permanent ban on the markets. 


Infrastructure impacts 


Conservationists are also worried about the 
environmental impact of China’s Belt & Road 
Initiative (BRI), a vast infrastructure-building 
project to connect Chinato many parts of the 
world that has been likened to a modern Silk 
Road. A 2019 study found that many of the road 
and rail lines planned for southeast Asia posea 
serious risk to biodiversity hotspots and could 
facilitate wildlife trafficking (A.C. Hughes Con- 
serv. Biol. 33, 883-894; 2019). Belt and Road 
projects, especially in developing countries, 
have not had enough oversight to ensure that 
ecologically fragile regions are protected, says 
study author Alice Hughes, a conservation 
biologist at the Chinese Academy of Sciences 
Xishuangbanna Tropical Botanical Garden. 

The BRI is not aligned with the biodiver- 
sity-conservation approach that research- 
ers hope China will champion as part of the 
Kunming agreement, says Simon Zadek, the 
London-based principal of Project Catalyst, an 
initiative of the United Nations Development 
Programme to spur action towards achieving 
the UN Sustainable Development Goals. 

Many countries are using the Kunming con- 
ference in October to highlight the environ- 
mental impact of the BRI, says Liat Greenpeace 
China. “This will be a big thing to watch out 
for, whether there will be an effective policy 
response from China’s side,” he says. 

Hughes says researchers are calling for the 
biodiversity treaty to address the international 
environmental impacts created by all coun- 
tries — notjust China — through activities such 
as foreign investment and trade. “International 
development projects are not being subjected 
to any level of scrutiny,” says Hughes. 

But van Havre says that the CBD doesn’t have 
the scope to consider environmental damage 
from foreign-investment projects. It deals only 
with sovereign states and their responsibilities 
within their own borders, he says. 

The negotiations in February will consider 
domestic mechanisms to protect the environ- 
ment, and this year’s zero draft is the first to 
include a specific mention of the need for 
countries receiving foreign investments to 
assess their environmental impact, says van 
Havre. For instance, if China wants to build a 
railway in another country, then that coun- 
try needs to do such an assessment before 
allowing the project to go ahead, he says. 

Li doesn’t think China will address environ- 
mental concerns about the BRI seriously at this 
year’s talks. The officials calling the shots on 
the BRl are not the ones handling the Kunming 
conference, he says. “I expect alot of buzz on 
overseas footprint, but I do not expecta lot of 
concrete policy progress,” he says. 


MARK CARWARDINE/NPL 
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Medics check on people with COVID-19 in Jinyintan Hospital in Wuhan, China. 


SLEW OF TRIALS LAUNCH 
TO TEST CORONAVIRUS 
TREATMENTS IN CHINA 


HIV drugs, stem cells and traditional Chinese 
medicines are vying to prove their worth. 


By Amy Maxmen 


hina has more than 80 running or 

pending clinical trials on potential 

treatments for COVID-19, the illness 

caused by a coronavirus that has so 

far killed more than1,800 people and 
infected more than 70,000 across the country, 
and for which there is currently no cure. 

New drugs are listed beside thousand-year- 
old traditional therapies and existing treat- 
ments for other diseases ina public registry of 
China’s clinical trials that is growing every day. 
But scientists caution that only carefully con- 
ducted trials will show which measures work. 

Soumya Swaminathan, chief scientist at 
the World Health Organization (WHO), says 
that the agency is drawing up a plan for aclin- 
ical-trial protocol that researchers around the 
world could use, and working with scientists to 
help set standards for the trials in China, which 
include as many as 600 people each. 

For example, a person’s stages of recovery or 
decline should be measured in the same way, 
regardless of the treatment being tested, says 
Swaminathan. “We can hopefully bring some 
sort of structure into the whole thing.” 

The WHO’s clinical-trial protocol will 
compare two or three therapies, including 


an HIV-drug combination (lopinavir and 
ritonavir) and an experimental antiviral called 
remdesivir. 

Researchers in China have begun testing 
these drugs in clinical trials, according to the 
Chinese Clinical Trial Registry, and there is 
already some evidence to suggest they have 
potential to fight the coronavirus. “Getting 
the clinical trials straight is a priority, since 
if we get information on what is working and 
not working, we can benefit patients now,” 
Swaminathan says. 


Animal results 


Thetwo HIV drugs block enzymes that viruses 
need to replicate. In animal studies, they have 
reduced levels of the coronaviruses that cause 
severe acute respiratory syndrome (SARS) and 
Middle East respiratory syndrome (MERS)'. 
Remdesivir, a nucleotide analogue made by 
the biotechnology company Gilead in Foster 
City, California, has also had some success 
against coronaviruses in animals”. And inJan- 
uary, researchers reported that one person 
in the United States had survived a COVID-19 
infection after being treated with remdesivir’. 

In the first week of February, two 
placebo-controlled trials of remdesivir, set 
toincludea total of 760 people with COVID-19, 
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began in China. Those studies should be 
completed by the end of April, and remdesivir 
could be approved by Chinese authorities as 
early as May, says Shibo Jiang, a virologist 
at Fudan University in Shanghai. “But the 
epidemic might be gone by then,” he says. 

Researchers in China have also launched 
a few trials that test chloroquine, a malaria 
drug that killed off the new coronavirus 
(recently named SARS-CoV-2) in cell culture’. 
And scientists are studying whether ster- 
oids diminish inflammation in people with 
severe COVID-19, or cause harm. “It will be 
interesting to see these results,” says Yazdan 
Yazdanpanah, an epidemiologist with France’s 
national health agency, INSERM, in Paris. 
Research clinicians around the world will need 
this information ifthe outbreak continues to 
spread, he adds. 

Another study — a 300-person controlled 
trial —willtest serum from COVID-19 survivors. 
The same basic idea — that the antibodies one 
person steadily builds up to fight a virus can 
help someone freshly infected to fight it off 
rapidly — has had modest success when used 
to treat other viruses in the past°. 

Two stem-cell trials are also listed in China’s 
registry. In one, a team at the First Affiliated 
Hospital of Zhejiang University will infuse 
28 people with stem cells derived from men- 
strual blood, and compare results with those 
from people who did not receive the infusions. 
So far, there is minimal evidence indicating 
that stem cells clear coronavirus infections. 
Swaminathan says that the WHO cannot con- 
trol what researchers do, but that the agency 
published guidance on the ethics of running 
trials amid outbreaks in 2016. And it will be 
posting a more accessible brief report on the 
issue soon. 

About 15 trials listed in China’s registry 
expect to enrol a total of more than 
2,000 people in studies on a variety of tradi- 
tional Chinese medicines. One of the largest 
assesses shuanghuanglian, a Chinese herbal 
medicine that contains extracts from the dried 
fruit liangiao (Forsythiae fructus), which is 
purported to have been used to treat infec- 
tions for more than 2,000 years. The trial has 
400 participants, including a control group 
given standard care but nota placebo therapy. 

The WHO is working with Chinese scientists 
to standardize the design of all the studies, 
including those on traditional medicines. 
That reflects a controversial move last year, 
in which the organization recognized tradi- 
tional Chinese medicine in its compendium 
of diseases. Critics argued that the WHO’s 
recognition amounted to endorsement, but 
Swaminathan disagrees. She says that the 
move helps to codify medical terminology so 
that herbal remedies can be evaluated with 
the rigour expected of pharmaceutical test- 
ing. “We want ascientific approach to testing 
traditional medicine,” she says. 
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News in focus 


With many therapeutic possibilities and 
limited time, Jiang says the WHO should 
provide advice about which treatments to 
move forward, and which to ditch, as trials 
progress. And he hopes that research on 
better, broader therapies will be continued 
after the outbreak ends. “I worry this will be 
the same situation as during SARS,” he says, 


“where the work starts, then stops.” 
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SCIENTISTS FEAR 
CORONAVIRUS SPREAD 
IN VULNERABLE NATIONS 


Concerns are rising about the virus’s potential 
to circulate undetected in Africa and Asia. 


By Smriti Mallapaty 


nfections with the new coronavirus have 

now been detected in 25 countries out- 

side China. But researchers warn that 

cases might be going undetected in some 

nations that are considered to be at high 
risk of an outbreak but are reporting fewer 
cases than expected, or none atall. 

The possibility of unreported cases of the 
disease, knownas COVID-19, is particularly con- 
cerning in countries with weaker health-care 
systems, such as some in southeast Asia and 
Africa, which could quickly be overwhelmed 
byalocal outbreak, experts say. So far, only one 
case has been reported in Africa — ina person 
in Egypt — but some countries there, such as 
Nigeria, are at particular risk because of their 
strong business ties to China. 

Researchers have been using flight data to 
create models of the possible spread of the 
virus around the world. One model identified 
30 countries or regions at risk of importing 
the virus on the basis of the large number of 
flights from Wuhan, the outbreak’s epicentre, 
and from other cities in China with many 
travellers from Wuhan. 

Thailand is the country most exposed, 
according to the study, which was published 
on 5 February and used flight data from Feb- 
ruary 2018 (S. Lai et al. Preprint at medRxiv 
http://doi.org/dmr4; 2020). Thirty-five people 
with the infection have been reported thereso 
far, of whom 23 had been in China. But study 
co-author Shengjie Lai, an epidemiologist 
at the University of Southampton, UK, says 
the model estimates that Thailand probably 
imported 207 cases in the 2 weeks before 
travel into and out of Wuhan was restricted in 
late January. 

Indonesia has not reported a single case so 
far, and yet the country is a popular destination 
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for Chinese tourists. Lai says it might have 
imported as many as 29 cases. Several other 
countries, including Malaysia, Vietnam, 
Cambodia and Australia, have also reported 
fewer cases than the model predicts, he says. 

Although it’s possible that there have truly 
been no cases in Indonesia, infected peo- 
ple might have recovered before they were 
detected, says epidemiologist Andrew Tatem, 
aco-author of the study also at the University 
of Southampton. Undetected cases might also 
be spreading under the radar, he says. 

Despite the predictions, Amin Soebandrio, 
an infectious-disease scientist and chair of 
the Eijkman Institute for Molecular Biology 
in Jakarta, says Indonesia has the capacity to 
detect the virus in people if it arrives. 

But some countries in southeast Asia have 
limited numbers of health-care workers, 


The coronavirus responsible for COVID-19. 
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hospital beds, support staff and ventilators, 
and would struggle to respond to a surge in 
cases of the virus, says Richard Coker, aretired 
physician based in Bangkok. 

Tedros Adhanom Ghebreyesus, director- 
general of the World Health Organization 
(WHO), said the agency’s decision to declare the 
outbreak a global health emergency was mainly 
due to concerns that the virus could spread in 
countries with weaker health-care systems. 


What about Africa? 


For that reason, infectious-disease research- 
ers are also worried about the virus spreading 
among people in Africa. A large number of 
Chinese labourers work in Africa, and their 
travel between China and Africa is a possible 
route for transmission, says Marc Lipsitch, an 
epidemiologist at the Harvard T.H. Chan School 
of Public Health in Boston, Massachusetts. 

Another model found that Egypt, Algeria 
and South Africa are the countries in Africa 
that are most at risk of the virus spreading. 
The analysis, published on 7 February, exam- 
ined flights to Africa from Chinese cities that 
had reported infections, but excluded cities 
in Hubei province, where Wuhan is located, 
because of the lockdown that has restricted 
travel from many cities there since late January 
(M. Gilbert etal. Preprint at medRxiv http://doi. 
org/dmr5; 2020). 

But these three countries also have the 
capacity to respond effectively to an outbreak, 
says Vittoria Colizza, who models infec- 
tious diseases at the Pierre Louis Institute of 
Epidemiology and Public Health in Paris and 
is aco-author of the Africa study. 

Colizza is most concerned about seven 
African nations that have a moderate risk of 
importing the virus, but whose weak health- 
care systems, low economic status or unstable 
political situation make them highly 
vulnerable. These are Nigeria, Ethiopia, Sudan, 
Angola, Tanzania, Ghana and Kenya. 

Until two weeks ago, many African nations 
did not have laboratories that could diag- 
nose COVID-19, and samples had to be tested 
abroad. But the situation is changing rapidly, 
says Colizza. Africa has gone from having only 
two labs with the capacity to confirm the virus 
to having at least eight, according to the WHO. 

Three of the newly added labs are in Nigeria, 
says Chikwe Ihekweazu, director-general of the 
Nigeria Centre for Disease Control in Abuja. 

Ihekweazu says Nigeria’s size, the volume of 
travellers it receives and its vibrant economy 
already make it vulnerable to importing an 
infectious disease, and that the country’s strong 
business ties with China pose a further risk. 

Nigeria has ramped up screening of travel- 
lers from China. Ihekweazu says the worst-case 
scenario for the country would be ifan infected 
person goes undetected and begins to infect 
others. “That is really what keeps me up at 
night,” he says. 
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News in focus 


POPULAR PREPRINT SITES 
FACE CLOSURE BECAUSE 
OF MONEY TROUBLES 


Repositories such as INA-Rxiv boost regional 
science, but paying for them is proving difficult. 


By Smriti Mallapaty 


he rise of preprint repositories has 
helped scientists worldwide to share 
results and get feedback quickly. But 
several platforms that serve research- 
ers in emerging economies are strug- 
gling to raise money to stay afloat. One, which 
hosts research from Indonesia, has decided to 
close because of this funding shortfall. 
INA-Rxiv, which was set up in 2017, was one 
of the first repositories to host studies froma 
particular region. Previous platforms served 
specific disciplines, such as arXiv for physi- 
cal-sciences research. Other region-specific 
repositories followed, including ArabixXiv, Afric- 
Arxivand IndiaRxiv. These repositories increase 
exposure for research from the regions, and 
facilitate collaborations, say their managers. 


The servers are run by local volunteers 
but hosted online by the non-profit Center 
for Open Science (COS) in Charlottes- 
ville, Virginia. The centre’s platform hosts 
26 repositories, including some that are dis- 
cipline-specific. In 2018, the COS told reposi- 
tory managers that it would be charging them 
maintenance fees from 2020. The charges 
start at about US$1,000 a year, and increase 
as repositories’ annual submissions grow. 

The costs can be significant, particularly for 
repositories in emerging economies. Dasapta 
Erwin Irawan, ahydrogeologist at the Bandung 
Institute of Technology who helped to set up 
INA-Rxiv, says the repository received more 
than 6,000 submissions between July 2018 
and June 2019, so the fees will come to about 
$25,000 per year, which he cannot afford. 
After unsuccessfully trying to raise money 


from the Indonesian government, he has 
decided to wind down the service and close 
it, although he has not yet set an end date. 

Juneman Abraham, a social psychologist at 
Bina Nusantara University in Jakarta, says he 
will lose an important source of information 
on the latest Indonesian research when the 
repository closes. 


Long-term survival 


The COS decided to introduce fees so that it 
could sustain its hosting service in the long 
term; running it will cost about $230,000 in 
2020, says Brian Nosek, the centre’s executive 
director. It used to rely on grants from private 
foundations, but they are no longer enough. 
Now the operating costs will be covered by a 
mix of grants and user contributions, he says. 

About half of the repositories have com- 
mitted to paying the fees so far; some are 
managed by organizations that have access 
to grants, whereas others have partnered with 
libraries for funding, says Nosek. The centre is 
committed to helping repositories, including 
INA-Rxiv, find funding, he says. 

But Nosek acknowledges that repositories 
runin emerging economies are most likely to 
struggle to raise funds. The centre will be flexi- 
ble about when groups pay this year’s fees, but 
if no money is received, it will freeze services 
so that they can’t accept submissions, he says. 
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CHAPTER ; 
FOR AFRICAN 

GENOMICS 


Nigeria is poised to become a hub for 
genetics research, but a few stubborn 
challenges block the way. 

By Amy Maxmen 
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n the affluent, beach-side neighbour- 
hoods of Lagos, finance and technology 
entrepreneurs mingle with investors at 
art openings and chic restaurants. Now 
biotech is entering the scene. Thirty- 
four-year-old Abasi Ene-Obong has 
been traversing the globe for the past 
six months, trying to draw investors and 
collaborators into a venture called 54Gene. 
Named to reflect the 54 countries in Africa, 
the genetics company aims to build the con- 
tinent’s largest biobank, with backing from 
Silicon Valley venture firms such as Y Com- 
binator and Fifty Years. The first step in that 
effort is a study, launched earlier this month, 
to sequence and analyse the genomes of 
100,000 Nigerians. 

At a trendy African fusion restaurant, 
Ene-Obong is explaining how the company 
can bring precision medicine to Nigeria, and 
generate a profit at the same time. He talks 
about some new investors and partners that 
he’s not able to name publicly, then pulls out 
his phone to show pictures ofa property hejust 
purchased to expand the company’s lab space. 

“My big-picture vision is that we can bea 
reason that new drugs are discovered,” Ene- 
Obong says. “I don’t want science for the 
sake of science, I want to do science to solve 
problems.” 

It’s too soon to say whether he will succeed. 
But his ambitions would have been unthink- 
able a decade ago, when most universities 
and hospitals in Nigeria lacked even the most 
basic tools for modern genetics research. Ene- 
Obong, the chief executive of 54Gene, is riding 
a wave of interest and investment in African 
genomics that is coursing through Nigeria. In 
arural townin the western part of the country, 
amicrobiologist is constructing a US$3.9-mil- 
lion genomics centre. And inthe capital city of 
Abuja, researchers are revamping the National 
Reference Laboratory to analyse DNA from 
200,000 blood samples stored in their new 
biobank. Studying everything from diabetes 
to cholera, these endeavours are designed to 
build the country’s capabilities so that genet- 
ics results from Africa — the publications, pat- 
ents, jobs and any resulting therapies — flow 
back to the continent. 

The rest of the world is interested, too. 
Africa contains much more genetic diversity 
than any other continent because humans 
originated there. This diversity can provide 
insights into human evolution and common 
diseases. Yet fewer than 2% of the genomes 
that have been analysed come from Africans. 
A dearth of molecular-biology research on the 
continent also means that people of African 
descent might not benefit from drugs tai- 
lored to unique genetic variations. Infec- 
tious-disease surveillance also falls short, 


54Gene aims to create Africa's largest biobank. 
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meaning that dangerous pathogens could 
evade detection until an outbreak is too big 
to contain easily. 

But Nigeria’s genetics revolution could just 
as soon sputter as soar. Although the coun- 
try is Africa’s largest economy, its research 
budget languishes at 0.2% of gross domestic 
product (GDP). Biologists therefore need to 
rely on private investment or on funding from 
outside Africa. This threatens continuity: one 
of the largest US grants to Nigerian geneticists, 
througha project knownas H3Africa, is set to 
expire in two years. There are other challenges. 
Human research in Africa requires copious 
communication and unique ethical consid- 
eration given the vast economic disparities 
and history of exploitation on the continent. 
Andalack of reliable electricity in Nigeria hob- 
bles research that relies on sub-zero freezers, 
sensitive equipment and computing power. 

Yet with a hustle that Nigerians are famous 
for, scientists are pushing ahead. Ene-Obong 
hopes to pursue research through partner- 
ships with pharmaceutical companies, and 
other geneticists are competing for interna- 
tional grants and collaborations, or looking 
to charge for biotech services that are usually 
provided by labs outside Africa. Last Novem- 
ber, Nnaemeka Ndodo, chief molecular bioen- 
gineer at the National Reference Laboratory, 
launched the Nigerian Society of Human 
Genetics in the hope of bringing scientists 
together. “When I look at the horizon it looks 
great — but in Nigeria you can never be sure,” 
he says. 


Building the foundation 


Around 15 years ago, Nigerian geneticist 
Charles Rotimi was feeling dismayed. He was 
enjoying academic success, but would have 
preferred to dosoinhis home country. He had 
left Africa to do cutting-edge research, and he 
was not alone. 

Many Nigerian academics move abroad. 
According to the Migration Policy Institute 
in Washington DC, 29% of Nigerians aged 25 
or older in the United States hold a master’s 
or a doctoral degree, compared with 11% of 
the general US population. 

After Rotimi joined the US National Insti- 
tutes of Health (NIH) in Bethesda, Maryland, in 
2008, he hatched a plan with director Francis 
Collins to drive genetics research in Africa. 
Rotimi wasn’t interested in one-off grants, 
but rather in building a foundation on which 
science could thrive. “The major thing to me 
was to create jobs so that people could dothe 
work locally,” he says. In 2010, the NIH and 
Wellcome, a biomedical charity in London, 
announced the H3Africa, or Human Heredity 
and Health in Africa, project. It’s become a 
$150-million, 10-year initiative that supports 
institutes in 12 African countries. The proof of 
its success will be notin the number of papers 
published, but rather inthe number of African 
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54Gene chief executive Abasi Ene-Obong is preparing to make Nigeria a genetics powerhouse. 


investigators able to charge ahead after the 
grant ends in 2022. 

For that to happen, H3Africa researchers 
realized they needed to revise research regu- 
lations and procedures for gaining the public’s 
trust. So rather than just collecting blood and 
leaving — the approach disparagingly referred 
toas helicopter research — many investigators 
on the team have devoted time to adapting 
studies for the African context. 

For example, when Mayowa Owolabi, a 
neurologist at the University of Ibadan, Nige- 
ria, was recruiting healthy controls for his 
H3Africa study on the genetics of stroke, his 
team discovered that many people had alarm- 
ingly high blood pressure and didn’t know it. 
Nigeria has one of the world’s highest stroke 
rates, and Owolabi realized that communities 
needed medical information and basic care 
more urgently than genetics. So he extended 
his study to include education on exercise, 
smoking and diet. And, on finding that many 
people had never heard of genetics, the team 
attempted to explain the concept. 

This is a continuing process. One morning 
last November — seven years into the pro- 
ject — acommunity leader in Ibadan visited 
Owolabi’s private clinic. He said tensions had 
mounted because people who had partici- 
pated in the study wanted to know the results 
of their genetic tests. Owolabi replied that they 
were still searching for genetic markers that 
would reveal a person’s risk of stroke, and that 
it might be many years until any were found. 
“But it’s a heart-warming question,” he says, 
“because ifthe people demand atest, it means 
the study is the right thing to do.” 
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Discovering the genetic underpinnings of 
stroke is also complicated by the fact that it, 
like many non-communicable disorders, is 
caused by a blend of biological and environ- 
mental factors. Owolabi flips through a blue 
booklet of questions answered by 9,000 par- 
ticipants so far. It asks about everything from 
family medical history to level of education. 
Insights are buried in the answers, even with- 
out DNA data: the team found, for instance, 
that young Nigerians and Ghanaians who eat 
green leafy vegetables every day have fewer 
strokes!. And that’s just the beginning. “You 
see the amount of data we have accrued,” he 
says. “I don’t think we have used even 3% of it, 
so we need to get more funding to keep the 
work going.” 

Owolabi’s team is now applying for new 
grants from the NIH, Wellcome and other inter- 
national donors to sustain the work after the 
H3Africa grant ends. And to make themselves 
more appealing to collaborators and donors, 
they’re increasing the amount of work they 
can doin Ibadan. Until last year, most of the 
genetic analyses were conducted at the Uni- 
versity of Alabama in Tuscaloosa. But lastJune, 
the University of Ibadan installed acomputer 
cluster to serve the project, and three young 
bioinformaticians are now crunching the data. 
“The big-data business is happening now,” says 
Adigun Taiwo Olufisayo, a doctoral student 
concentrating on bioinformatics. But he also 
admits that funds are tight. 

Last year, other graduate students on the 
team began to extract DNA from samples 
so that they can scour it for genetic vari- 
ants linked to strokes. In a room the size of a 
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cupboard, a technician labels tubes beside a 
freezer. Coker Motunrayo, a doctoral student 
studying memory loss after strokes, sits on 
the counter-top because there’s not enough 
space for a chair. She insists that the H3Africa 
project is a success, eventhough their genetics 
work has just started. “Compare this to where 
we were five years ago, and you'd be stunned,” 
she says. 


Onthecusp 


Perhaps the most advanced genomics facility 
in West Africa right now is located in Ede in 
southwestern Nigeria. At Redeemer’s Univer- 
sity, a private institution founded bya Nigerian 
megachurch, microbiologist Christian Happi 
is building an empire. Construction teams 
are busy creating a $3.9-million home for the 
African Centre of Excellence for Genomics of 
Infectious Diseases. 

Happi strides across a veranda, and intoa 
series of rooms that will become a high-level 
biosafety laboratory suitable for working on 
Ebolaand other dangerous pathogens. Another 
small room nearby will house a NovaSeq 6000 
machine made by Illumina in San Diego, Cali- 
fornia, a multimillion-dollar piece of equip- 
ment that can sequence an entire human 
genome in less than 12 hours. It’s the first of 
that model on the continent, says Happi, and 
it positions his centre, and Africa, “tobecomea 
player in the field of precision medicine”. Then 
he announces that Herman Miller furniture is 
onthe way. Ifit’s good enough for his collabora- 
tors at the Broad Institute of MIT and Harvardin 
Cambridge, Massachusetts, he adds, itis good 
enough for his team. 

Happi plans to move his lab into the facility 
inafew months. But the team is already doing 
advanced work on emerging outbreaks. Ata 
small desk, one of Happi’s graduate students, 
Judith Oguzie, stares at an interactive pie chart 
on her laptop. The chart displays all of the 
genetic sequences recovered from a blood 
sample shipped to the lab from a hospital as 
part of a countrywide effort to learn which 
microbes are infecting people with fevers. Typ- 
ically, doctors test the patients for the disease 
they think is most likely, such as malaria, but 
this means other infections can be missed. For 
example, the sequences Oguzie is looking at 
belong to the Plasmodium parasites that cause 
malaria, the virus that causes the deadly Lassa 
fever, and human papillomavirus. 

Oguzie says that a few years ago, she was 
processing samples from a hospital in which 
people were dying because their fevers had 
confounded diagnosis. With the help of 
next-generation sequencing, she found that 
they were infected with the virus that causes 
yellow fever. She showed Happithe results, 
and he reported the news to the Nigeria Cen- 
tre for Disease Control (NCDC), which rapidly 
launched a vaccination campaign. 

This was exactly what Oguzie had wanted 
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out of science. “I’m happy when | solve prob- 
lems that have to do with life,” she says. She 
worked hard throughout university in Borno, 
even after the terrorist organization Boko 
Haram started attacking the northern state. 
She heard bomb blasts during lectures and 
knew people who were shot. 

Nevertheless, Oguzie finished her degree in 
2011. She hadasona few years later and wanted 
to stay with her family in Nigeria, but she strug- 
gled to find a graduate institution that would 
allow her to excel in genetics. She had already 
begun searching for scholarships at univer- 
sities in the United Kingdom, Australia and 
the United States when she found out about 
Happi’s lab. 

Happi had been persuaded to return to 
Nigeria from Harvard School of Public Health 
in Boston. The vice-chancellor of Redeemer’s 
at the time was an influential virologist named 
Oyewale Tomori. He offered Happia lucrative 
start-up package to build an environment 
similar to the one he had become used to at 
Harvard. 

Soon after he joined the university, Happi 
won H2Africa grants totalling $6.8 million 
that have led to some impressive projects. For 
example, heandhis collaborators mapped the 
spread of infections in the country’s largest 
outbreak of Lassa virus”. He also won World 
Bank funding for an African centre of genom- 
ics. The grant is paid out incrementally onthe 
basis of milestones such as training graduate 
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students or researchers from another African 
country. So far, his centre has earned more 
than $9 million. 

He says the money means that he can offer 
experienced researchers salaries that stop 
them from leaving Nigeria and keep his lab 
up to date with the fast-moving field. Happi 
invites a rotating cast of top infectious-disease 
scientists from the United States to collabo- 
rate with his team in Ede. “I want to build a 
place where we can work together,” he says, 
“nota place from where things are taken away.” 

But in an office beside Happi’s, geneticist 
Onikepe Folarin says she has no time to con- 
duct research because she’s constantly writing 
grant proposals, and reporting back to donors 
on various milestones. To lessen their reliance 
on grants, she and Happi plan to start selling 
genomics services. 

At the moment, African researchers pay 
a lot to ship samples and reagents to and 
from China and the United States, and these 
items often get held up at ports. But with his 


sequencing equipment and machines to pro- 
duce important reagents such as primers, 
Happihopes to provide acommercial service 
to other researchers on the continent — and 
use the money to fund his research. 


Disruptors 


As the son ofa plant geneticist, 54Gene head 
Ene-Obong developed a certain angst about 
the fits and starts of international grants. So 
after earning a PhD in genetics, he studied busi- 
ness with the aim of driving research sustain- 
ably. One idea he has for 54Gene is to charge 
drug-development firms for access to the 
genetic data in the company’s biobank. This 
model has proved successful elsewhere. For 
example, last year, the UK Biobank received 
$120 million from 4 pharmaceutical giants 
for access to information on 125,000 people. 

54Gene won't say howitis financing its study 
to analyse 100,000 genomes from Nigerians, 
but it has gained the backing of physicians 
from 17 hospitals across the country, who will 
send blood samples from consenting people 
with chronic diseases suchas cancer, diabetes 
and Alzheimer’s disease. 

Butas the first for-profit genetics endeavour 
in Nigeria, 54Gene must navigate uncharted 
ethical territory. People could feel cheated 
if they donate samples to research and then 
learn that the company turned a profit while 
they struggle to afford health care. Concerns 
about being taken advantage of loom large 
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Onikepe Folarin and Christian Happi stand in front of a soon-to-be completed genomics centre for studying infectious disease in Nigeria. 
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A perennial concern, Nigeria’s underpowered infrastructure frustrates technology firms. 


in Nigeria — and in Africa more generally — 
because of a history of the continent being 
exploited for everything from slavery to 
diamonds. As Anthony Ahumibe, the senior 
laboratory adviser at the NCDC, says: “Blood 
is aresource, whether it’s inside humans or 
outside.” 

The concerns are well founded. Last year, 
for example, the Sanger Institute in Hinxton, 
UK, came under fire for licensing a gene chip 
based on African genome data to US biotech- 
nology company Thermo Fisher, which was 
planning to manufacture the chip for a profit. 
This infuriated both the African researchers 
who had collaborated with the British team 
and the Ugandan study participants, who had 
not consented to the deal. 

Seeing the potential for disaster, Aminu 
Yakubu, a bioethicist who helped revise Nige- 
ria’s regulations at the start of the H3Africa 
projects, offered to join 54Gene last year to 
help the company come up with solutions. “I 
understand why people will be sceptical, so we 
willbe as transparent as possible, and sensitive 
to concerns about exploitation,” he says. He 
and Ene-Obong are devising ways to give back 
to the public even before genetic discoveries 
are made. For example, they might donate 
dialysis machines to participating hospitals 
that lack them. “We are not just doing this to 
make money,’ says Ene-Obong. “As a private 
company, we need money to operate, but my 
goal is to study African genetics and translate 
the insights into products that help people.” 


The barriers 

Unlike their younger colleagues, some estab- 
lished Nigerian researchers hesitate to cel- 
ebrate the country’s inarguable growth in 
genomics because they see obstacles in the 


354 | Nature | Vol578 | 20 February 2020 


path ahead. One of the biggest challenges is the 
lack of national funding. In 2016, it seemed that 
Nigeria’s government was realizing the impor- 
tance of research when it approved a measure 
tocommit 1% ofits GDP to science and technol- 
ogy. That would have amounted to $3.8 billion 
last year, but the money never materialized, 
and the research budget remains at about $750 
million annually — the total across all fields. 
Tomori compares this situation with that 
in China — another middle-income country. 
A decade ago, China’s government plied the 
field of genetics with incentives such as tax 
exemptions and housing for scientists, and it 
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put 2% of its GDP into research. Those invest- 
ments have paid off; in 2018, China surpassed 
Europe in biotech investment. 

And because the Nigerian government does 
not fund muchscience, it has limited power to 
set research agendas. That could stunt genet- 
ics projects because the most powerful stud- 
ies stem from long-term national initiatives, 
suchas the UK Biobank and the China Kadoorie 
Biobank, says Prabhat Jha, an epidemiologist 
at the University of Toronto in Canada. Nige- 
ria does have a few large biobanks, generally 
attached to specific research projects — and 
54Gene’s would add to that, but Jha warns that 
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it’s often difficult to cobble together samples 
from disparate studies because the data were 
collected with different aims. Creating a uni- 
fied genomics initiative should bea priority, he 
says. “If there were good prospective studies 
under way in Africa,” he adds, “we could really 
start to understand the key determinants of 
diseases and deaths there.” 

Even more basic problems stand in the way 
of success, not least the lack of a reliable elec- 
trical grid. “Until the government puts in basic 
infrastructure, we cannot move forward,’ says 
Tomori. Inthe meantime, institutes and com- 
panies are spending a huge portion of their 
budgets on back-up generators, diesel fuel and 
solar panels. According to a report released 
last year by the International Monetary Fund, 
Nigeria’s inadequate electricity supply costs 
the country about $29 billion per year’. And 
ina survey by the Center for Global Develop- 
ment, Nigeria’s booming tech sector named 
electricity as its number-one constraint’. 

To change the status quo, Tomori says, his 
Nigerian colleagues must persuade their lead- 
ers and the public that investments in science 
matter. “If we sit in our labs doing the same 
things, the situation will notimprove,’ Tomori 
says. “We need to get out of our test tubes and 
talk about these issues.” 

But the director of genomics research at the 
Nigerian Ministry of Science and Technology 
in Abuja, Oyekanmi Nash, argues that gov- 
ernment funding will flow more freely once 
science starts to deliver tangible benefits. He 
credits H3Africa with triggering the first steps. 
Now, he says, it’s up to researchers to build on 
the effort and show how their science helps. 
Nash joined 54Gene’s initiative to sequence 
100,000 genomes because of the start-up’s 
promise to translate genetics results into med- 
icines. “Once we become strong enough,” he 
says, “the government will listen.” 

It’satough bet to make, especially given that 
Nigeria’s post-recession economy remains 
sluggish. But the country’s younger geneti- 
cists don’t really have an option outside of 
optimism. “It’s not been easy,” says Ndodo. 
“Most of us have worked until the middle of the 
night, taken out loans to get training outside 
[Nigeria], and then come back to change the 
system.” But, he says, scientists are on firmer 
ground than their predecessors. And they’re 
driven. “No one else will tell our story,” Ndodo 
says. “No one else will do research that targets 
our own interests.” 


Amy Maxmen writes for Nature from Oakland, 
California. 
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Monuments to resilience or collapse? The 800-year-old statues of Easter Island. 


Panicking about societal collapse? 
Plunder the bookshelves 


As civilization seems to be lurching towards a cliff edge, historical case studies are 
giving way to big data in authors’ search for understanding. By Laura Spinney. 


ncase you missed it, the end is nigh. Ever 
since Jared Diamond published his hugely 
popular 2005 work Collapse, books on 
the same theme have been arriving with 
the frequency of palace coups in the late 
Roman Empire. Clearly, their authors are 
responding toa universal preoccupation with 
climate change, as well as to growing financial 
and political instability and a sense that civili- 
zationis lurching towards a cliffedge. Mention 


is also made of how big-data tools are shed- 
ding new light on historical questions. But do 
these books have anything useful to share? Any 


“What if collapse could 
usher in not only arenewed 
world, buta better one?” 
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actionable points besides that on my coffee 
mug: “Now panic and freak out”? 

The newest is Before the Collapse. In it, 
energy specialist Ugo Bardi urges us not to 
resist collapse, which is how the Universe 
tries “to get rid of the old to make space for 
the new’. Similarly, Diamond’s 2019 book 
Upheaval suggested that a collapse is an 
opportunity for self-appraisal, after which a 
society can use its ingenuity to find solutions. 
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Both writers seem to accept that collapse 
is inevitable, but they take very different 
approaches to analysing it. Diamond zooms 
in to glean lessons from historical case stud- 
ies; Bardi zooms out to view societies as com- 
plex dynamic systems that behave cyclically. 
Numerous books published in the past few 
decades chart how research has shifted from 
Diamond's approach to Bardi’s. 


Robust debate 


Questioning Collapse, a 2009 collection 
of essays edited by archaeologists Patricia 
McAnany and Norman Yoffee, took Diamond 
to task for cherry-picking to spin a good yarn, 
for example in blaming such iconic societal 
failures as the population crash of Easter 
Island on its people’s destruction of their own 
environment. The story is not so simple, the 
authors argue. The Indigenous Rapa Nui soci- 
ety weathered a string of environmental crises 
— very few ofits own making — yet thrived until 
the first Europeans arrived. Likewise, is it rea- 
sonable to claim that Mayan society collapsed 
around the ninth century, given that seven 
million people living in and around Central 
America speak Mayan languages today? These 
cases might be better viewed, say McAnany 
and Yoffee, as lessons in resilience. 

Scholars have long warned against peering 
down the ‘retrospectoscope’ at apparently 
neat examples of what not to do. In his 
influential 1988 The Collapse of Complex 
Societies, archaeologist Joseph Tainter argues 
that collapse — in the sense of the complete 
obliteration of a political system and its 
associated culture — is rare. Even the worst 
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cases are usually better described as rapid loss 
of complexity, with remnants of the old society 
living onin what rises from the ashes. After the 
‘fall’ of Rome inthe fifth century, for example, 
successor states took more than 1,000 years 


Collapse: How Societies Choose 
to Fail or Succeed 
Jared Diamond Viking (2005) 


Before the Collapse: A Guide to the Other Side 
of Growth 
Ugo Bardi Springer (2020) 


Upheaval: Turning Points for Nations in Crisis 
Jared Diamond Little Brown (2019) 


Questioning Collapse: Human Resilience, 
Ecological Vulnerability, and the Aftermath of 
Empire 

Edited by Patricia A. McAnany & Norman Yoffee 
Cambridge Univ. Press (2009) 


The Collapse of Complex Societies 
Joseph Tainter Cambridge Univ. Press (1988) 


Understanding Collapse: Ancient History and 
Modern Myths 
Guy D. Middleton Cambridge Univ. Press (2017) 


Why the West Rules — for Now: The Patterns of 
History, and What They Reveal About the Future 
lan Morris Farrar, Straus and Giroux (2010) 


War and Peace and War: The Rise and Fall of 
Empires 
Peter Turchin Pi (2006) 


Revolution and Rebellion in the 
Early Modern World 
Jack Goldstone Univ. California Press (1991) 
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to achieve comparable economic and tech- 
nological sophistication, but were always 
recognizably the empire’s offspring. 

Nevertheless, societies do go through rocky 
patches, from which some emerge trans- 
formed. It’s not surprising that scholars should 
want to understand why. In his thoughtful 
Understanding Collapse (2017), archaeologist 
Guy Middleton surveys more than 40 theo- 
ries of collapse — including Diamond’s — and 
concludes that the cause is almost always 
identified as external to the society. Perennial 
favourites include climate change and barbar- 
ian invasions — or, in the Hollywood version, 
alien lizards. The theories say more about the 
theorists and their times, Middleton argues, 
than about the true causes of collapse. 


Under strain 


The pressing question, Tainter tolda workshop 
on collapse at Princeton University in New 
Jersey last April, is why can a society with- 
stand repeated external blows — until one 
day it cannot? For him, a society fails when 
it is no longer able to adapt to diminishing 
returns on innovation: when it can’t afford 
the bureaucracy required torunit, say. In Why 
the West Rules — For Now (2010), historian lan 
Morris proposes a twist on this, namely that 
the key to a society’s success lies in its ability 
to capture energy — by extracting it from the 
ground, for example, or from nuclear fission 
once fossil fuels have run out. By contrast, 
Peter Turchin, author of the 2006 War and 
Peace and War, suggests that collapse is what 
happens when a society stops being able to 
deal with the strains caused by population 
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growth, leading to inequality and strife. 
Turchin has been compared to Hari Seldon, 
science-fiction writer Isaac Asimov's “psycho- 
historian”, who studies the past to statistically 
predict the future. He belongs to anew breed of 
scientific historian taking a big-data approach, 
and argues — controversially — that societal 
spasms are cyclic. This idea itself comes and 
goes: the ancient Greeks took the cyclic nature 
of history for granted, but it has been unfash- 
ionable since the Enlightenment. Today, we 
tend to have a linear concept of progress, in 
which life generally improves for most people 
over the long term. Works such as Turchin’s 
see this trend as superimposed onan inherent 
cyclicity in the evolution of societies. 


Reboot cycle 


This raises the question of whether collapse 
is essential to renewal. Without winter, can 
you have spring? Bardi says no. Whether 
you think this good or bad depends partly 
on your point of view. The mass extinction 
66 million years ago was bad for dinosaurs, 
but good for mammals, sociologist Miguel 
Centeno observed at the Princeton workshop, 
which he convened. But if collapse could usher 
in not only a renewed world, but a better one, 
shouldn’t we dinosaurs embrace it? 

For Turchin and Jack Goldstone — on whose 
work on the demographic forces shaping his- 
tory Turchin builds — this is good advice only 
if you understand what causes collapse. Then 
it might be possible to make the transition less 
violent or disruptive. Goldstone rigorously 
dissected upheaval in the sixteenth to the 
nineteenth centuries in his 1991 book Revolu- 
tion and Rebellion in the Early Modern World. 
This convinced him that revolution is aninap- 
propriate response to societal tensions, usu- 
ally leading to tyranny. Solutions have come 
instead from deep, meaningful reform. Yet 
the idea that revolution removes obstacles 
to progress has “deluded literally billions of 
people”, he argues. 

An interdisciplinary community of 
researchers is now searching for patterns 
that have defined collapse throughout history, 
to determine what might be an appropriate 
response. If we can’t and shouldn’t prevent 
a future crisis, could we at least soften it — 
perhaps with the help of new technologies — 
so that renewal happens, but less is lost and 
fewer people suffer? Even ifthe mind-boggling 
complexity of human societies makes this a 
pipe dream, as some argue, it seems a sounder 
approach than sparring over case studies that 
might not have constituted collapse at all. 
Speaking as a dinosaur, whose only alternative 
is to panic and freak out, I'll take it. 


Laura Spinney is a science writer based in Paris. 
Her most recent book is Pale Rider: The Spanish 
Flu of 1918 and How it Changed the World. 
e-mail: lfspinney@gmail.com 
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Cured 

Jeffrey Rediger Flatiron (2020) 

An experienced physician who is also a skilled, driven and 
compassionate writer is a winning combination. This pioneering book 
by psychiatrist Jeffrey Rediger analyses unexplained spontaneous 
recoveries from potentially fatal medical conditions, including 
cancer. From interviewing patients over nearly two decades, Rediger 
concludes that each recovery was “unique” and only partially 
explicable, but that all provide evidence of “a powerful link” between 
our identities and our immune systems. 
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Disaster by Choice 

Ilan Kelman Oxford Univ. Press (2020) 

Human choices cause disasters, but can also prevent them, argues 
lan Kelman in this grimly informative history. A specialist in disasters 
and health, he surveys earthquakes, epidemics, floods and more in 

a range of countries. Thus, in 2010, a magnitude-7.1 earthquake near 
Christchurch, New Zealand, caused not a single death. The same 
year, a magnitude-7.0 quake in Haiti caused at least 100,000 fatalities 
and a cholera outbreak — because of poor buildings and health care. 
Scientific foresight and political will are always key to resilience. 


Journey into Light 


Under the Stars 

Matt Gaw Elliott & Thompson (2020) 

The Milky Way is invisible to 77% of today’s UK population because of 
artificial light, notes naturalist and journalist Matt Gaw: “Many adults 
and children, my own included, have never seen it.” Such thoughts 
inspired this poetically written but scientifically grounded study of 
darkness and its effect on humans and wildlife. Gaw describes night 
wanderings on English beaches, across Dartmoor and in central 
London. On the Scottish island Coll, a Dark Sky Community without a 
single street light, his children were entranced by the stars. 
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Stealth 

Peter Westwick Oxford Univ. Press (2020) 

In 1961, Dwight Eisenhower warned in his last address as US president 
that the “military-industrial complex” must be checked for the sake of 
“security and liberty”. Historian Peter Westwick is more positive in his 
incisive narrative of the top-secret 1970s invention and construction 
of the stealth plane F-117. Nearly invisible to Soviet-designed radar, it 
was used to crucial effect in the 1991 Gulf War. Westwick argues that it 
offered an alternative to nuclear weapons, but admits that “to defend 
American liberties, aerospace engineers gave up civil liberties”. 


Syukuro Manabe & 
Anthony J. Broccoli 


BEYOND 
GLOBAL 


WARMING 


How Numerical 
Models Revealed 
the Secrets of 
Climate Change 


Beyond Global Warming 

Syukuro Manabe & Anthony J. Broccoli Princeton Univ. Press (2020) 
The first global climate model, developed in 1896 by chemist Svante 
Arrhenius, included the warming effect of atmospheric carbon 
dioxide. In the 1960s, meteorologist Syukuro Manabe pioneered 
computer simulation of climate change. Manabe’s book written with 
atmospheric scientist Anthony Broccoli has evolved from his lecture 
notes, with chapters on, for example, general circulation models. 
Although technical, it should prove useful to those wishing to 
understand global warming’s future impact. Andrew Robinson 
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Louis Nirenberg 


(1925-2020) 


Mathematician who transformed the study of partial differential equations. 


fter the Second World War, 
mathematics in the United States 
flourished owing toa convergence of 
interests. Mathematicians had shown 
their worth to military and indus- 
try patrons, who underwrote far-reaching 
empires of theories and people, including the 
consummate problem-solver Louis Nirenberg. 

One of the world’s most cited and 
productive mathematicians, Nirenberg was 
also among the most collaborative. His work 
continued to make waves until he was well 
into his eighties, and reshaped how mathe- 
maticians understand and study dynamical 
systems, from cells to markets. Winning the 
2015 Abel Prize (shared withJohn Nash, made 
famous by the 2001 film A Beautiful Mind) was 
just a bookend toa féted career. He died on 
26 January, aged 94. 

Nirenberg spent an illustrious seven 
decades at New York University (NYU), a 
realization of the discipline’s post-war entan- 
glements and their scholarly rewards. He 
joyfully nurtured people and ideas, skating 
above emerging distinctions between pure 
and applied mathematics. 

Nirenberg transformed the field of partial 
differential equations (PDE), which explores 
what can be known about mathematical func- 
tions from studying how their variations along 
different dimensions relate to each other. 
Emerging from eighteenth-century math- 
ematical physics, PDE became a centrepiece of 
avast range of theoretical and applied subjects, 
from telecommunications and nuclear phys- 
ics to debates about the nature of numbers. 
One famous and still-unresolved question in 
which Nirenberg’s insights have been signifi- 
cant asks whether the equations governing the 
movement of water froma given initial state are 
always compatible with a smooth flow. 

A virtuoso of approximation, Nirenberg was 
renowned for manipulating inequalities that 
govern the properties of unknown functions. 
Fellow mathematicians found his perspectives 
and methods strikingly lucid. His works cata- 
lysed large bodies of research, from general 
relativity to biology. 

Raised in a Yiddish-speaking family in 
Montreal, Canada, Nirenberg acquired a taste 
for mathematical puzzles from his Hebrew 
tutor. After completing his undergraduate 
degree in mathematics and physics at McGill 
University in Montreal in 1945, Nirenberg 
joined his friend Sarah Courant at the nearby 


National Research Council, contributing to 
research on atomic weapons. On the advice 
of Sarah’s father-in-law, Richard Courant, a 
leading mathematician at NYU, Nirenberg did 
a master’s in mathematics at the university. 
He remained there for the rest of his career, 
heading the Courant Institute of Mathematical 
Sciences from 1970 to 1972. 

Nirenberg trained with a who’s who of 
twentieth-century mathematics, including 
his PhD supervisor James Stoker and his men- 
tor Kurt Friedrichs as well as visiting scholars 
from across Europe, the Soviet Union and the 
Americas. With fellow students Peter Lax and 
Cathleen Morawetz, he climbed the ranks to 
professor. 

Courant had courted federal contracts and 
support during the Second World War to lay 
the groundwork for his institute. Expansive 
budgets from sources including the Office 
of Naval Research supported Nirenberg’s 
research on elliptic equations (with appli- 
cations from fluid dynamics to finance) and 
pseudo-differential operators (a foundation 
for an enormous variety of approaches in 
modern physics). One-quarter of his publica- 
tions, including his first four in 1953, were in 
the institute’s own journal, Communications 
on Pure and Applied Mathematics. 

Nirenberg considered the world’s 
mathematicians to be “one big family”, and 
found inspiration in visiting and hosting 
colleagues from around the world. His first 
extended overseas research trip, in 1951-52, 
took him to Zurich, Switzerland, where he 
wrote up results from his thesis and attended 
lectures from stars of Courant’s generation. In 
1963, he took partina landmark symposium on 
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PDE in Novosibirsk, which helped to redefine 
the relationship between the Soviet Union 
and the United States. There, he forged close 
friendships in an environment he compared to 
a voyage at sea. A later geopolitically signifi- 
cant trip took him to China toward the end of 
the Cultural Revolution. After being assigned 
aPhD thesis in Italian as the subject foraterm 
paper during his graduate studies, he devel- 
oped alifelong affinity for Italy. 

Nirenberg was known for using methods in 
their most fruitful generality. “I have made a 
living off the maximum principle,” he quipped, 
referring to a fundamental technique for 
establishing inequalities in PDE. He demon- 
strated its versatile potential to researchersin 
many fields. As a young man, he had worried 
about his ability to formulate original prob- 
lems. Yet Nirenberg gained a reputation for 
his exceptional insight and taste as a poser of 
problems that stretched the limits of research 
in mathematics and beyond. 

His career awards included the first Crafoord 
Prize in 1982 and the first Chern Medal of the 
International Mathematical Union in 2010. 
Although he knew Nash, their Abel Prize rec- 
ognized PDE work from separate parts of the 
firmament. 

A famously congenial collaborator, 
Nirenberg co-authored papers to an extent 
unusual in mathematics. Some collaborations 
took place entirely by post, including the only 
work he published with his lifelong colleague 
Lax, conducted while Nirenberg was in Japan. 
Other collaborations — including with the 
46 doctoral students he supervised — involved 
extended dialogues in front of a blackboard or 
while walking to a restaurant, as he digested 
new ideas in company. 


Brit Shields works on the cultural history 

of twentieth-century mathematics; she is 
senior lecturer in the School of Engineering 
and Applied Science at the University of 
Pennsylvania in Philadelphia and a part-time 
programme administrator at the Courant 
Institute of Mathematical Sciences, where 
she knew Nirenberg. Michael J. Barany 

is lecturer in the history of science at the 
University of Edinburgh, UK, where he 
studies the global history and culture of 
modern mathematics. Both had interviewed 
Nirenberg for their research. 

e-mails: bshields@seas.upenn.edu; 
m.barany@ed.ac.uk 
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Setting the agenda in research 
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Coral in a mangrove swamp in the Raja Ampat Islands, Indonesia. 


Set a global target 
for ecosystems 


James E. M. Watson, David A. Keith, Bernardo B. N. Strassburg, Oscar Venter, Brooke Williams & Emily Nicholson 


The conservation community 
must be able to track 
countries’ progress in 
protecting wetlands, reefs, 
forests and more. 
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ext week, representatives of more than 
190 nations are gathering in Rome to 
discuss how to halt the biodiversity 
crisis during this decade and beyond. 

Since 2010, targets for conserving 
species have shaped policy and galvanized 
efforts to halt species loss worldwide, as part 
of the Convention on Biological Diversity 
(CBD; see www.cbd.int/sp/targets). Yet no 
such targets exist for ecosystems — despite the 
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wealth of evidence showing that their health 
and functions are essential to the processes 
that maintain all life’. 

Targets that are specific, measurable, 
attainable, relevant and timely (SMART) are 
central to project planning and have proved 
to be effective in policies that seek to address 
global problems. For example, during the 
1980s, a group of 20 nations agreed to set vari- 
ous limits onthe production and consumption 


GIORDANO CIPRIANI/GETTY 


SOURCES: J. E. M. WATSON ETAL. (MAP); DATA FROM O. VENTER ET AL. NATURE 


COMMUN. 7, 12558 (2016)/B. S. HALPERN ETAL. SCI. REP. 9, 11609 (2019). 


of chlorofluorocarbons. This helped to guide 
the phase-out of these substances under the 
Montreal Protocol, which came into effect in 
1989 (ref. 2). 

It is now possible to establish a SMART 
target for ecosystems, as well as metrics to 
track progress in meeting that goal. Nations 
are no longer limited by a lack of knowledge or 
methods when it comes to ecosystem mapping 
and assessment (see ‘Under pressure’). What’s 
more, they canusea proven and standardized 
approach for ecosystem risk assessment: 
the Red List of Ecosystems protocol, which 
was adopted by the International Union for 
Conservation of Nature (IUCN) in 2014. 

We urge those attending next week’s 
meeting to place an ecosystem-based goal 
and target alongside species-based ones in 
their discussions. Nations have a chance to 
ensure that all of the world’s remaining intact 
ecosystems are retained by 2030, that over- 
all ecosystem area and integrity increase by 
2050, and that all that fall below a level of deg- 
radation defined by the Red List of Ecosystems 
protocol are restored. 

The ratification of an international target 
will compel governments to act. This is the 
only way to halt the decline of ecosystems. 


Species and ecosystems targets 


In 2010, the 193 nations that were parties to the 
CBD agreed to work together to prevent the 
extinction of known threatened species and 
improve their conservation status by 2020. 
They did this by ratifying Target 12 of the CBD 
2011-20 strategic plan for biodiversity (see 
www.cbd.int/sp/targets). 

Actions taken because of this and previous 
CBD targets have reduced the risk of extinc- 
tion for many species, although direct links 
are hard to prove. For example, conservation 
efforts over the past 30 years have helped to 
cut the extinction rate of endangered birds 
by at least 40%, according to one analysis®. 
Previously endangered populations that 
are now growing include the Seychelles 
magpie-robin (Copsychus sechellarum) 
and a Brazilian parrot called Lear’s macaw 
(Anodorhynchus leari). 

Over the past decade, nations have been 
identifying and protecting the marine, ter- 
restrial and freshwater sites that are of inter- 
national importance to the conservation of 
vulnerable species. More than 16,000 of these 
‘key biodiversity areas’ have now been identi- 
fied worldwide (see go.nature.com/2xdtqb8). 
Government reports submitted to the CBD 
indicate that such areas are increasingly 
being protected*. One example is Itombwe 
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Natural Reserve in the Democratic Republic 
of the Congo, which was formally established 
in 2016 to conserve several rare species, 
including the enigmatic Itombwe puddle frog 
(Phrynobatrachus sp.). 

Such species-focused conservation activi- 
ties are crucial. But they are not sufficient to 
sustain biodiversity and the benefits of nature 
to humanity. 

Ecosystems, from the boreal forest and wet- 
lands to coral reefs and mangroves, are more 
than the total of the plants and animals living 
in them®. Complex interactions between bio- 
logical and physical systems drive processes 
that sustain all life. This includes production 
of clean water, regulation of air quality and 
climate through carbon sequestration and 
storage, soil formation, pollination and the 


“Nations havea chance to 
ensure that all of the world’s 
remaining intact ecosystems 
are retained by 2030.” 


production of food and wood for houses!. 
Indeed, natural systems are key to dealing with 
the effects of climate change, as highlighted 
by a2019 study’. It estimated that, between 
2000 and 2013, the impact on carbon levels 
of losing intact tropical forests (including 
indirect effects such as reduced biodiversity 
and increased selective logging) might be six 
times greater than was originally proposed’. 
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Thanks to substantial advances in mapping 
and monitoring, scientists can now diagnose 
ecosystems’ defining features and the 
processes that threaten them*’. Take the 
demise of tidal flats revealed by satellite tech- 
nology. Such mapping showed that coastal 
development and sea-level rise destroyed 16% 
of these ecosystems between 1984 and 2016. 
This has reduced storm protection and food 
provision for billions of people’. Remote sens- 
ing is similarly monitoring tropical forests’, 
ice cover”, coral reefs" and mangroves”. For 
instance, at least 12% of the world’s mangroves 
were lost between 1996 and 2010 because of 
human activities”. 

Pivotal to these efforts has been the 
development of the Red List of Ecosystems 
protocol, a set of criteria for identifying eco- 
systems that are mostat risk of collapse”. It lays 
out how to define and map ecosystems, and 
enables systematic risk assessment using an 
array of indicators of extent and degradation. 

So far, the Red List criteria have been used 
to assess more than 2,800 ecosystems in 
100 countries across all continents”; 45% 
of those systems were found to be at risk of 
collapse (D.A.K., unpublished observation). 
These efforts could serve as a starting point 
for work towards an international target for 
conserving ecosystems. 

Ecosystem-level conservation is already 
affecting decisions on resource use and 
management made by national govern- 
ments, non-governmental organizations and 
industry’. For example, a 2017 assessment 
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of ecosystems in Colombia — Amazon 
rainforests, tropical dry forests, high Andean 
cloud forests, lowland savannah and other 
types — classified almost half (44%) as either 
‘endangered’ or ‘critically endangered’, as 
defined by the Red List protocol’. 

This results from human activities such 
as forest clearance for illegal coca crops, 
cattle ranching and mining. The finding has 
prompted the Colombian government to 
focus on the amount of land given protected 
area status, and to consider the restoration of 
critically endangered ecosystems. 

In South Africa and Australia, businesses 
wanting to encroach on ecosystems that are 
classed as critically endangered or endan- 
gered must first conduct a full environmental 
impact assessment for their proposed project. 
Likewise, Finland’s first government-led sys- 
tematic ecosystem assessment, completed 
in 2008, resulted in increased protection of 
threatened forest under the nation’s Environ- 
ment Protection Act and Forest Act”. 

In China, assessments of the rapid decline 
of tidal-flat ecosystems has catalysed efforts 
to better understand, manage and protect 
them. Tidal flats surrounding the Yellow Sea 
in east Asia support the migration of up to 
three million shorebirds and stabilize the 
coastline for more than 150 million people, 
also providing them with storm protection 
and food"®. In July 2019, two of these impor- 
tant migratory sites were added to the United 
Nations Educational, Scientific and Cultural 
Organization (UNESCO) World Heritage List 
after being classified as endangered under 
the IUCN criteria. 


Action and accountability 


It is difficult to accurately assess progress 
towards conservation targets at the species 
level — a major constraint on their effective- 
ness. Monitoring of at-risk species is often 
infrequent and numbers fluctuate naturally 
from year to year. Such species also tend to 
be elusive. At the ecosystem level, aSMART 
target should therefore enable frequent 
tracking of ecosystems using remote sens- 
ing and modelling. This could result in 
more-transparent reporting of the status 
of Earth’s ecosystems, enhancing public 
awareness of their current trajectories and 
the consequences of their decline. 

Any ecosystem target should set limits on 
degradation that mark the irreversible loss of 
key processes“. A target should also highlight 
the importance of conserving healthy ecosys- 
tems over restoring degraded ones. Such res- 
toration is technologically and economically 
challenging and, as yet, there is no evidence 
that complete restoration of an ecosystem is 
even possible. Nevertheless, restoration has 
a key role in avoiding species extinctions and 
mitigating climate change, and should be part 
of an ecosystem goal. 
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Human impacts such as overfishing have affected the Amazon River ecosystem in Brazil. 


The Rome meeting is the second of three 
working-group meetings for negotiations 
leading up to anewset of biodiversity targets, 
which will replace those agreed in 2010. This 
2030 global strategic plan for biodiversity 
will be formally established in October by the 
signatories to the CBD. 

This year marks the implementation of the 
pledges made in the Paris climate agreement, 
and the United Nations Decade on Ecosystem 
Restoration begins in 2021. The launch of the 
2030 strategic plan in October is an unprece- 
dented opportunity — perhaps the last — for 
humanity to address multiple environmental 
problems at once. Whereas a species target 
forces nations to report on their progress only 
inrelation to biodiversity, an ecosystem target 
would necessitate simultaneous reporting on 
wins across three fronts: biodiversity, climate 
change and sustainability (specifically, on 
the United Nations Sustainable Development 
Goals for human development and well-being). 

World leaders must be held accountable for 
the current and future state of their countries’ 
ecosystems. 
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Readers respond 


Correspondence 


Global solutions to 
prevent a pandemic 


Investment in research must be 
fast-tracked if we are to tackle 
the new coronavirus disease, 
COVID-19. We need greater 
insight into the transmission, 
progression and epidemiology 
of this respiratory illness. 

We need to know the risk 
factors for infection, the role 
of asymptomatic or mild 
infection and the nature of 
‘super-spreaders’. We must 
determine disease seasonality 
and the viability of the virus in 
hot, humid environments, and 
improve estimates of death rates 
by age. 

Research relevant to countries 
with weaker surveillance, lab 
facilities and health systems 
should be prioritized. In those 
regions, vaccine supply routes 
should not rely on refrigeration, 
and diagnostics should be 
available at the point of care. 
The World Health Organization 
is mapping such research and 
development priorities. 

Social-science issues are 
important, too. These include 
how to communicate to the 
public what the options are 
for managing and preventing 
the disease, and how to tackle 
misconceptions and fear 
and avoid stigmatization. 
Community engagement 
and responsibility must be 
encouraged. 


Charlotte H. Watts Government 
Department for International 
Development, London, UK. 


Patrick Vallance Government 
Office for Science, London, UK. 


Christopher J. M. Whitty 
Government Department of 
Health and Social Care, London, 
UK. 

chris.whitty@dhsc.gov.uk 


Careless virus names 
stoke sinophobia 


The coronavirus that is currently 
causing severe respiratory 
illness worldwide has now 
been named SARS-CoV-2, and 
the disease is COVID-19. When 
the virus first emerged last 
December, it was generally 
described in medical journals 
as the ‘2019 novel coronavirus’. 
Nature, however, used ‘China 
coronavirus’ and ‘Wuhan 
coronavirus’. Suchinterim 
terminology based on 
geographic characteristics 

is objectionable because it 

can stimulate prejudice and 
discrimination against Chinese 
people, fuelled internationally 
by fear spread through social 
media. 

Although it is difficult and 
time-consuming to formally 
name diseases and viruses, it is 
essential that we methodically 
select no-harm names for them 
to make their way into human 
history. In 2015, the World 
Health Organization issued 
guidelines intended to minimize 
“unnecessary negative impact of 
disease names on trade, travel, 
tourism or animal welfare, 
and avoid causing offence to 
any cultural, social, national, 
regional, professional or ethnic 
groups”. It asks scientists, 
journalists and health officials 
to use neutral, generic terms 
when referring to new human 
infectious diseases. 


Lele Shu University of California, 
Davis, California, USA. 
lele.shu@gmail.com 


Editor's note: Nature has 
stopped referring to the 2019 
novel coronavirus (SARS-CoV-2) 
as the Wuhan or China virus, 

for the reasons cited in the 
Correspondence. The names that 
appeared in earlier headlines 
were used to reflect the situation 
as it was understood at the time. 


Nospecial code for 
disaster research 


As directors of the University 

of Delaware’s Disaster Research 
Center, we disagree with the call 
byJ. C. Gaillard and Lori Peek for 
acode of conduct for disaster- 
zone research (Nature 575, 
440-442; 2019). 

In our view, sucha customized 
code would be likely to create 
acompliance morass out of all 
proportion to any ostensible 
harm. For example, the authors 
apply too broad abrushin 
referring to ‘communities’ and 
‘local priorities’. Communities 
are characterized by politics, 
power differences and 
stakeholders clamouring for 
attention. The authors suggest 
that research should align with 
community priorities. But rarely 
is there a single local priority, 
so whose priorities should take 
precedence, and why? Those 
priorities might even recreate 
the conditions that led to the 
disaster, or further marginalize 
other voices. 

A disaster zone is not easy to 
define. The whole of Japan was 
affected by the 2011 Tohoku 
earthquake and tsunami, for 
example — even areas that were 
not physically hit. And, contrary 
to the authors’ implication, 
there is no evidence that ethical 
concerns in post-disaster 
research are more severe than in 
other research involving human 
participants. 

Such research can be done 
badly if, for example, the 
researcher has not properly 
reviewed the vast literature on 
quick-response best practice. 
Imposing criteria set by the 
United Nations would not 
prevent that. Dissemination and 
refinement of best practices 
remain the most crucial goals. 


James Kendra, Tricia 
Wachtendorf University of 


Delaware, Newark, Delaware, USA. 


jmkendra@udel.edu 


Authorship: include 
citizen scientists 


In our view, protocols for 
academic authorship need to 
adapt to acknowledge those 
members of the public who 
are increasingly engaging 
inimportant collaborations 
with researchers. These 
citizen scientists, who might 
include naturalists, farmers 
or Indigenous communities, 
rarely meet rigid journal- 
imposed criteria for 
authorship (see, for example, 
go.nature.com/2urkbrp). 
Consequently, protocols 
designed to stamp out ethical 
breaches, suchas ghost 
authorship and conflicts of 
interest, exclude contributors 
who are not professional 
scientists. 

Providing due credit is acore 
tenet of scientific ethics, and 
citizen scientists are pivotal to 
research projects and to the 
resulting publications. 

Creating group 
co-authorships for cohorts 
of citizen scientists would 
credit them under a collective 
identity (see, for example, 

G. Ward-Fear et al. Trends Ecol. 
Evol. http://doi.org/ggd6v7; 
2019). Furthermore, citizen 
scientists can play a crucial 
part in the uptake of scientific 
understanding by the general 
public. 


Georgia Ward-Fear* Macquarie 
University, Sydney, Australia. 
georgia.ward-fear@mq.edu.au 
*On behalf of 4 correspondents 
(see go.nature.com/37kr9q5). 


HOW TO SUBMIT 


Correspondence may be 
submitted to correspondence@ 
nature.com after consulting the 
author guidelines and section 
policies at go.nature.com/ 
cmchno. 
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Expert insight into current research 


News & views 


Epigenetics 


How to silence 


an X chromosome 


Jackson B. Trotman & J. Mauro Calabrese 


The non-coding RNA Xist has been shown to enlist the SPEN 
protein to recruit a team of protein complexes — initiating 
the process that prevents transcription of one of the two 

X chromosomes found in female mammalian cells. See p.455 


Female mammals have two X chromosomes, 
whereas males have only one. A remarkable 
solution has therefore evolved to prevent a 
gross imbalance in gene expression occur- 
ring between the sexes: in every cell that has 
two X chromosomes, one entire X chromo- 
some is ‘silenced’ to prevent RNA from being 
transcribed from it. This process is called 
X-chromosome inactivation (XCI) and initiates 
early inthe development of female embryos. 
Once complete, XCl is essentially stable for 
life’ — thus, by extension, ahuman X chromo- 
some can be propagated in the silenced state 
for more than 100 years. 

XClI has become a paradigm for epigenetic 
processes — those in which DNA and asso- 
ciated proteins are modified to alter gene 
expression — and has been intensively stud- 
ied for decades. For the past 25 years, much of 
this research has centred on along non-coding 
RNA (IncRNA) called Xist, which is needed to 
orchestrate XCI. However, the details of Xist’s 
silencing mechanism have been elusive. Dossin 
etal.’ report a stunning series of experiments 
on page 455 that reveal how Xist silences genes 
by partnering with a protein called SPEN. 

Xist is expressed exclusively from the 
X chromosome that will be inactivated, where 
it spreads locally and silences nearly every 
gene onthe chromosome by associating with 
anarray of proteins. For example, Xist engages 
the Polycomb protein complexes (which mod- 
ify the histone proteins that package DNA 
into a condensed form called chromatin) to 
maintain gene silencing on the inactivated 
X chromosome”. Although this maintenance 
function is well documented, how Xist silences 
active genes in the first place has remained a 
mystery — in part because the majority of Xist’s 
protein partners were unknown. But in 2015,a 
series of studies* ’ revealed a comprehensive 


set of proteins involved in XCI. These screens 
all identified SPEN as a Xist-binding protein 
that is essential for XCI. 

SPEN belongs to an evolutionarily con- 
served family of RNA-binding proteins that 
have been implicated in transcriptional silenc- 
ing and, curiously, RNA processing in both 
animals and plants”. To interrogate SPEN’s 
role in XCI, Dossin etal. first used a biological 
system known as an auxin-inducible degron 
to rapidly degrade SPEN in mouse embryonic 
stem cells. Consistent with a 2019 report”, 
the authors observed that Xist is almost 
completely unable to silence genes along the 
X chromosome in the absence of SPEN. In an 
important first, the authors demonstrated 
that SPEN is required for successful XCl in vivo 
in mice. They also found that SPEN was needed 


Enhancer 


Promoter Gene 


HDAC3 


to dampen expression of ‘escapees’ — genes 
on the silenced X chromosome that partially 
evade XCI. 

By observing fluorescently labelled 
molecules in living cells, Dossin et al. found 
that SPEN is recruited to the X chromosomeas 
soon as Xist expression begins at the onset of 
XCI. SPEN contains four RNA-binding domains 
(called RRMs) at its amino-terminal end and 
an evolutionarily conserved SPOC domain 
at its carboxy-terminal end. The authors 
found that, although RRMs 2-4 are required 
to bind Xist, the SPOC domain is the essential 
mediator of gene silencing. As suggested by 
previously reported experiments”, forcing an 
interaction between Xist and the SPOC domain 
alone was enough to restore XCl in cells that 
lack SPEN. 

It has been proposed” that SPEN confers 
gene-silencing capabilities on Xist by recruit- 
ing and/or locally activating the enzyme 
HDAC3, which removes gene-activating 
acetyl groups from histones. However, HDAC3 
accounts for only part of the gene silencing 
that occurs during the early stages of XCI°. To 
find other mechanisms by which SPEN might 
bring about silencing, Dossin etal. used amass 
spectrometry technique to identify proteins 
that interact with the SPOC domain. 

Confirming earlier work", the authors 
found that SPEN’s SPOC domain interacts not 
only with HDAC3, but also with the associated 
co-repressor proteins NCOR1 and NCOR2 
(also called SMRT), and with components of 
the nucleosome remodelling and deacetylase 
(NuRD) complex, all of which are epigenetic 


Xist 


Figure 1 | Mechanism of gene silencing by SPEN. The long non-coding RNA Xist and its protein cofactor, 
SPEN, suppress (silence) gene expression in one of the two X chromosomes found in female mammalian cells. 
This is an essential process that prevents a gross imbalance in gene expression between males and females. 
Dossin and colleagues’ experiments? suggest that SPEN initiates this silencing mechanism by binding to 
active gene promoters (DNA sequences that initiate transcription) and enhancers (sequences that increase 
the likelihood of transcription). SPEN recognizes active promoters in part by interacting with constituents 

of the machinery used for gene transcription, including RNA polymerase II (Pol II, the enzyme that catalyses 
transcription). SPEN also recruits and/or locally activates the gene-inactivating protein HDAC3, and gene- 
silencing protein complexes such as the nucleosome remodelling and deacetylase (NURD) complex. Oncea 
gene has been silenced, SPEN disengages from its binding site, possibly displacing Pol Il in the process. 
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silencers. Moreover, the authors observed 
that the SPOC domain interacts with parts 
of the machinery used for transcription and 
splicing (the process by which newly made RNA 
transcripts are turned into messenger RNA), 
including RNA polymerase Il, the enzyme that 
catalyses transcription. Dossin and colleagues 
identified interactions with components of 
the N°-methyladenosine (m°A) methyltrans- 
ferase complex, several of which have been 
linked to XCI°"". Accordingly, SPEN and its 
array of associated proteins might function 
like a molecular multi-tool to silence genes 
in various genomic contexts. Although much 
of SPEN’s silencing function might derive 
from its interactions with known epigenetic 
silencers, its association with transcription 
and RNA-processing machineries leaves open 
the possibility that SPEN can also silence 
genes through another, as-yet-undefined 
mechanism. 

Perhaps most strikingly, Dossin et al. 
adapted a technique called CUT&RUN to map 
the location of SPEN on an X chromosome 
that was being inactivated. This revealed 
that, shortly after Xist starts to be expressed, 
SPEN associates with active gene promoters 
and enhancers (DNA regions that initiate 
and increase the likelihood of transcription, 
respectively), but then disengages from these 
sites after it has silenced transcription. These 
discoveries imply that SPEN is part of asystem 
that recruits silencing machinery specifically 
totranscriptionally active regulatory elements 
at the onset of XCI (Fig. 1). Whether this mech- 
anism also requires chromatin modifications, 
RNA polymerase Il, actively transcribed RNA 
or other factors should be addressed in the 
future. Another issue that should be investi- 
gated is why Xist isn’t silenced by SPEN, given 
that alarge amount of SPEN accumulates over 
the Xist gene. 

SPEN binds to a region of Xist RNA called 
Repeat A, which is required to initiate gene 
silencing®**. Because deleting the Spen 
gene largely mirrors the effects of deleting 
Repeat A (ref. 11), SPEN seems to be respon- 
sible for most of Repeat A’s silencing ability. 
However, Repeat A also binds to other pro- 
teins, including those that normally promote 
splicing, as well as to RBM15 and RBMISB, 
SPEN’s SPOC-domain-containing cousins*>”. 
Therefore, it is now crucial to determine how 
these proteins might compete or cooperate 
with SPEN to initiate gene silencing. Moreover, 
deletion of Repeat A drastically reduces levels 
of the Xist RNA itself", and, in certain contexts, 
deletion of SPEN similarly reduces levels of 
Xist". How Repeat A is required for the produc- 
tion of Xist, and howits role in Xist production 
relates to its ability to initiate silencing, are key 
questions for the future. 

For decades, Xist has served as a leading 
example of RNA’s role in regulating gene 
expression. Most notably, Xist was one of the 
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first mammalian RNAs shown to be involved in 
Polycomb-mediated silencing’. It therefore 
seems appropriate that, by studying this RNA, 
Dossin et al. might have uncovered anew and 
fundamental aspect of gene regulation — the 
transient recruitment of SPEN to regulatory 
elements by RNAs, or even by proteins, which 
could be a general mechanism for silencing 
transcription throughout the mammalian 
genome. 
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Why surface roughness is 
similar at different scales 


Astrid S. de Wijn 


Most surfaces are rough at many length scales. Simulations 
show that this characteristic originates at the atomic level in 
metal-based materials when smooth blocks of these materials 


are compressed. 


Almost all solid surfaces are rough. This 
roughness occurs at length scales that 
encompass 13 orders of magnitude — from 
the kilometre-scale peaks of mountains, 
down to atomic-scale bumps. Roughness 
seems to emerge regardless of what is done 
to a surface. Yet there is little understand- 
ing of how this roughness comes about, and 
especially why it is often self-affine, meaning 
that asurface looks similar on different length 
scales. Writing in Science Advances, Hinkle 
et al.‘ show that self-affine roughness has its 
origin at the atomic level. 

Asanyone who has ever slipped ona wet floor 
will have noticed, the roughness of surfaces 
can have a crucial role in practical situations. 
Smoothsurfaces are slippery when wet, but are 
also easier to lubricate inside moving machin- 
ery than are rough surfaces. By contrast, we 
sand surfaces before painting them to make 
them rougher, and thereby to increase the 
adhesion of the paint. The effects of roughness 
are less straightforward in other situations: for 
example, the roughness of the surfaces of skis 
and snowboards affects their friction on snow 
differently depending onthe temperature and 
humidity”. Engineers have therefore developed 
many techniques to control surface roughness, 
such as grinding, polishing and so on. Hinkle 
and colleagues’ results help us to understand 
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better how roughness emerges, and thus might 
provide new ideas for howto control it. 

The authors carried out computational 
simulations of three materials: a single, per- 
fect gold crystal, an alloy and a metallic glass. 
These materials have very different amounts 
and types of disorder, which means that rough- 
ness might be expected to develop through 
different mechanisms or to have different 
characteristics for each of them. Because the 
deformation ofa material is likely to contribute 
to the formation of roughness, the research- 
ers simulated the compression of flat blocks 
of these materials beyond their elastic limit — 
that is, at forces that cause irreversible (plas- 
tic) deformation. Because the length scales of 
the effects the researchers were looking for 
span several orders of magnitude, the simu- 
lations had to be quite large, containing tens 
of millions of atoms. Such simulations are 
computationally extremely expensive. 

Hinkle and colleagues investigated how 
fluctuations in the roughness produced 
in the simulations change when the size of 
the area being observed is increased. They 
observed that the roughness profiles of all 
three materials seem to obey a power law 
— that is, they do indeed display self-affine 
scaling, over nearly two orders of magnitude 
(from about 1 nanometre up to the size of their 


REF.1 


Figure 1 | Roughness ona simulated gold surface. Hinkle et al.' carried out molecular-dynamics simulations 
of tens of millions of atoms in smooth blocks of three materials, including gold (shown here), and observed 
how surface roughness develops when the blocks are compressed. Colours represent atomic positions 
perpendicular to the surface, measured relative to the surface’s mean height: red indicates high topography; 
blue, low. The highest features are 8.8 nanometres above the lowest point on the surface. The authors found 
that roughness emerges that is similar across nearly two orders of magnitude of length scales. Similar 
triangular features and variation of topography are visible in a (a region 80 nm across) and b (aregion ofa 
expanded to four times its original size). The same is also true at magnifications of 8 and 64 (not shown). 


simulation ‘box’, which was approximately 
70-100 nm; Fig. 1). 

In addition to simulating millions of atoms, 
the authors simulated a continuum model 
of compressive deformation in which the 
material is not treated as being composed 
of individual atoms, but as a continuous 
medium. Inthese simulations, there is no sign 
of self-affine roughness. The authors therefore 
conclude that the development of self-affine 
roughness is related to atomic-scale fluctua- 
tions in plastic flow that are missing from the 
continuum model. 

Hinkle and colleagues’ results are convinc- 
ing across the observed length scales, but the 
scaling behaviour of the roughness will need to 
be demonstrated across three orders of magni- 
tude to confirm that it truly obeys a power law. 
This will require the atomic simulations to be 
extended to even larger scales. Modelling tech- 
niques (see ref. 3, for example) are available at 
mesoscale lengths (which range from a few 
nanometres to several hundred micrometres), 
and provide a link between atomistic and con- 
tinuum simulations. These approaches take 
flow into account in more detail than does the 
continuum model used by Hinkle et al., and 
would allow for increased atomistic detail and 
fluctuations in simulations. This could help to 
provide the extra order of magnitude needed 
to convincingly show the power-law statistics 
of the roughness. 

It remains to be seen how universal the 
reported behaviour is. All of the materials 
investigated by Hinkle and co-workers are 
based on metals. After undergoing plastic 
deformation, they are all homogeneous (there 
is only one type of solid phase in the material) 
but disordered, and the dynamics and energy 


scales involved in atom displacement are all 
comparable. It would be interesting to see 
whether similar scaling behaviour emerges 
from the compression of other types of 
material that have different mechanisms of 
plasticity and deformation, suchas polymers. 
Ifso, are the scaling exponents — the key scal- 
ing parameters in the power-law equation 
—thesame for all materials? If roughness pro- 
files can be extended to include one or more 
extra orders of magnitude, it would enable a 
reliable comparison of the scaling exponents. 


Cancer 


This, inturn, would help to determine whether 
these exponents vary with strain, deformation 
mechanisms, or even time. 

Power-law behaviour is common in plastic 
deformation. For example, ‘avalanches’ of 
plastic deformation occur in metals’, and in 
fibrous materials a power law describes the size 
distribution of avalanches when these mate- 
rials deform plastically under tensile stress°. 
Given that Hinkle et al. simulate the forma- 
tion of rough surfaces in response to plastic 
deformation, and also observe scale-free 
roughness in the bulk of the modelled materi- 
als, it seems likely that there is a link between 
the development of self-affine roughness and 
the power-law behaviour of plastic deforma- 
tion events — as the authors also note. It would 
therefore now be interesting to study the emer- 
gence of roughness in a more dynamic way, 
by investigating the formation of roughness 
features during compression, and relating the 
changes inthe surface profile to plastic events. 
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Loss of p53 protein strikes 
anerve for tumour growth 


Marco Napoli & Elsa R. Flores 


Tumours often grow entangled among neurons, which makes 
the cancer difficult to treat. The finding that cancer cells hijack 
neighbouring neurons to promote tumour growth suggests 


new therapeutic targets. See p.449 


Malignant tumours are a complex, yet 
organized, diverse ensemble of cells. 
Tumour cells are surrounded by other types 
of cell, which collectively form the tumour 
microenvironment. Components of this micro- 
environment include fibroblast cells, whichcan 
promote the growth and spread of tumours to 
distant sites, and immune cells. The latter have 
antitumour functions that are often suppressed 
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by cancer cells; indeed, therapies that boost 
such immune cells are revolutionizing the 
treatment of certain cancers. By contrast, the 
interactions between cancer cells and neurons 
inthe tumour microenvironment are less-well 
understood. On page 449, Amit et al.’ reveal 
how tumours influence neurons to promote 
tumour growth, and show how this discovery 
might lead to new anticancer therapies. 
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Figure 1| Tumours manipulate neighbouring neurons to boost cancer growth. Amit et al. analysed 

head and neck cancers using clinical data and mouse models. a, Tumour cells that expressed wild-type p53 
protein released vesicles containing small RNA molecules called microRNAs (miRNAs) that were taken up 
by neighbouring neurons. An miRNA known as miR-34a blocks neuronal proliferation, and the neurons were 
maintained in their current state. By contrast, tumours that had a mutant version of p53 released vesicles 
that lacked miR-34a. In this case, neurons increased in number in the vicinity of the tumour, and these cells 
were reprogrammed as adrenergic neurons that express the molecule noradrenaline. These neurons had 
more axonal branches than did those near tumours that expressed wild-type p53. Interactions between 
adrenergic neurons and the tumour aided cancer growth. b, When mice received a transplant of p53- 
deficient tumour cells, treatment with a drug (carvedilol) that blocks adrenergic signalling pathways slowed 
tumour growth. This might provide anew therapeutic tool for targeting tumours that need neighbouring 


adrenergic neurons for their growth. 


The interplay between cancer and neurons 
has negative clinical consequences for people 
with prostate tumours’. Individuals who have 
a higher number of new neurons (in structures 
called nerve fibres) in the tumour micro- 
environment tend to have more-aggressive 
tumour features, such as further tumour 
growth and migration to distant sites, anda 
decrease in survival time’. 

Studies last year found that cancer cells 
and neurons can interact directly with each 
other through connections called synapses, 
that these connections aid the growth of 
brain tumours?> called gliomas, and that 
this interaction is associated with lethal 
cancer spread’. These and other findings © 
contribute to a growing body of evidence 
that neurons are crucial components of the 
tumour microenvironment. However, what 
prompts the formation of neurons in the 
microenvironment had not been understood 
until now. 

Amit and colleagues took on this challenge 
by focusing on tumours known as head and 
neck cancers, which can arise in the oral cavity. 
In humans, these tumours are often charac- 
terized by mutations that inactivate the gene 
TP53. This gene encodes a protein (p53) that 
functions as atumour suppressor and that can 
modulate the tumour microenvironment’. By 
analysing four different mouse models of this 
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disease and data obtained from biopsies of 
people with head and neck cancer, the authors 
found that tumours with mutant versions of 
p53 have a higher number of associated newly 
formed neurons than do those with wild-type 
p53. Moreover, an increased number of such 
neurons correlated with a shorter survival 
time. 

To try to determine whether cancer cells 
with mutant p53 might stimulate neurons to 
form, Amit etal. analysed the factors released 


“Theauthors findings might 
have repercussions that 
reach beyond the field of 
cancer research.” 


by human cancer cells that have mutant or 
wild-type p53. Both types of cell secreted vesi- 
cles that contained small RNA molecules called 
microRNAs (miRNAs). The vesicles in the 
two cell types were of a comparable number 
and size, but their contents differed (Fig. 1). 
Only the vesicles secreted from tumours 
with mutant p53 were devoid of an miRNA 
termed miR-34a, which is a tumour suppres- 
sor. When vesicles from tumours lacking p53 
were injected into mice with tumours that had 
wild-type p53, the tumours with wild-type 
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p53 grew larger and had more surrounding 
neurons than normal, indicating that the con- 
tents of these vesicles drive the formation of 
new neurons. This is the first report showing 
that miR-34a, the main function of whichis 
to keep in check the proliferation of normal 
and cancer cells’, is important in counter- 
acting the formation of neurons in the tumour 
microenvironment. 

Amit and colleagues analysed how these 
newly formed neurons promote tumour 
growth. The authors examined the neurons 
present in tumours with mutant and wild- 
type p53. Intriguingly, in the former set, the 
neurons (presumably including those already 
present inthe area where the tumour formed) 
had undergonea functional change to become 
atype of neuron knownas an adrenergic neu- 
ron — which uses the adrenergic signalling 
pathway andis activated in the ‘fight-or-flight’ 
response. This adrenergic feature (which has 
hallmarks including expression of the molec- 
ule noradrenaline) was crucial for sustaining 
cancer growth. 

Interestingly, previous epidemiological 
analysis’ revealed that the use of the drug 
carvedilol, which blocks adrenergic signalling 
and is prescribed for conditions such as high 
blood pressure, is associated with a reduced 
risk of cancer onset. Now, Amit etal. raise the 
question of whether carvedilol’s anticancer 
properties might be due to its ability to target 
adrenergic neurons, given the effectiveness of 
the drug in treating mice with p53-deficient 
tumours (Fig. 1). The authors’ findings are 
of particular interest because these insights 
might offer a way to combat the tumour- 
driven formation of adrenergic neurons and 
to counteract their tumour-promoting effects. 
It willbe important to establish whether adren- 
ergic neurons’ contributiontotumour growth 
is limited to just head and neck cancers that 
have mutant p53, or whether this phenom- 
enon could also bea feature of other types of 
tumour, as suggested by the epidemiological 
evidence for carvedilol use’. 

Mutant versions of the gene encoding p53 
are among the most common alterations in 
certain human cancers, occurring in approx- 
imately 60% of colon cancers, 50-80% of lung 
cancers and 95% of ovarian tumours”. Given 
the high prevalence of p53 abnormalities in 
cancer, numerous efforts have been made to 
design compounds that target mutant p53 
to force it to act like wild-type p53, and prom- 
ising results have been obtained in early-phase 
clinical trials of such drugs". It would be worth 
testing whether using both carvedilol and a 
drug that targets mutant p53 is more effective 
than either compoundalone intreating these 
lethal forms of cancer. 

Amit and colleagues’ discovery that the 
absence of functional p53 influences the for- 
mation of neighbouring neurons might have 
relevance for interpreting reports showing 


that fluctuations in the levels of wild-type p53 
are observed in nerve regeneration”. Thus, the 
authors’ findings might have repercussions 
that reach beyond the field of cancer research 
to regenerative medicine. Perhaps therapies 
that modulate the activity of p53 will have a 
future role in aiding the repair or regenera- 
tion of neurons, an outcome that would make 
a profound difference to the lives of people 
who have neurodegenerative diseases or other 
types of nerve injury. 
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Fundamental symmetry 
tested using antihydrogen 


Randolf Pohl 


The breaking of a property of nature called charge-parity-time 
symmetry might explain the observed lack of antimatter in 
the Universe. Scientists have now hunted for such symmetry 
breaking using the antimatter atom antihydrogen. See p.375 


One of the greatest mysteries in modern 
physics is why the Universe seems to contain 
mostly matter and almost no antimatter. This 
observation could be explained if a property 
of nature called charge-parity-time (CPT) 
symmetry is violated. Under CPT symmetry, 
the physics of particles and their antiparticles 
is identical. A tiny violation of CPT symmetry 
during the Big Bang could, in principle, be 
responsible for the lack of antiparticles in 
the Universe. On page 375, the ALPHA Collab- 
oration’ reports high-precision spectroscopic 
measurements of antihydrogen — an atom 
comprising an antiproton and a positron (the 
antiparticle of an electron). The authors find 
that the gaps between energy levels in anti- 
hydrogen are in excellent agreement with 
those measured previously in ordinary hydro- 
gen? +, placing strong constraints on potential 
CPT violation. 

Tests of CPT symmetry using individual 
particles — such as neutral kaons°, positrons® 
and antiprotons’*— have shown no sign of CPT 
violation. However, studies of antihydrogen 
might probe the influence of factors that were 
not explored in previous tests. 

Hydrogen is the simplest atom, and its 
properties can be calculated with impressive 
precision. For more than a century, the study 
of this atom has been the driving force behind 
groundbreaking ideas about the structure of 
matter. The optical spectrum of hydrogen was 
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Figure 1| Lowest-energy states of antihydrogen. 
The ALPHA Collaboration’ carried out high- 
precision spectroscopic measurements of 
antihydrogen — the antimatter counterpart of 
hydrogen. Specifically, the team determined the 
energy differences between the 1S ground state and 
the 2P,,. and 2P3,. excited states of antihydrogen. 
They used these results to estimate the fine- 
structure splitting (the 2P,,.-2P3,. energy gap). They 
also combined their previous determination” of 
the energy gap between the 1S and 2S states with 
their current measurement of the 1S-2P,,, energy 
difference to infer the Lamb shift (the 2S-2P,,. 
energy gap). The authors found that all of these 
results are in agreement with the corresponding 
ones for ordinary hydrogen. (Drawing not to scale.) 
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measured with great accuracy in the 1880s, 
before being quantitatively explained in the 
1910s. The structure of the atom was then 
at the heart of the formulation of quantum 
mechanics and in the generalization of this 
theory to relativistic (fast-moving) particles 
in the 1920s. And it was the unexpected dis- 
covery’ of an energy gap between the 2S and 
2P,,. excited states of hydrogen by the physicist 
Willis Lamb in 1947 that motivated the devel- 
opment of quantum electrodynamics — the 
theory that describes the interactions between 
particles and light. 

This energy gap, known as the Lamb shift, 
exists in both hydrogen and antihydrogen. 
It originates mostly from quantum fluctua- 
tions, whereby particle-antiparticle pairs 
spontaneously emerge in empty space and 
then instantly annihilate each other. How- 
ever, its magnitude is subtly affected by, for 
example, the charge radius (the spatial extent 
of the charge distribution) of the proton or 
antiproton, the weak nuclear force and, poten- 
tially, currently unknown phenomena that 
could be the source of the matter—antimatter 
asymmetry in the Universe. 

The current work was carried out using 
the ALPHA experiment at CERN, Europe’s 
particle-physics laboratory near Geneva, 
Switzerland. A facility called the Antiproton 
Decelerator delivers antiprotons to this 
experiment, with a source of radioactive 
sodium providing positrons. Every few min- 
utes, 90,000 cold trapped antiprotons and 
3 million positrons are mixed in a sophisti- 
cated charged-particle trap. This process 
yields about 20 cold antihydrogen atoms 
that are then confined to a neutral-atom trap 
made from superconducting magnets. These 
antihydrogen atoms can be stored” for at 
least 60 hours, and production cycles can be 
repeated to obtain hundreds of such atoms. 

The aim of the present study was to 
measure the energy differences between the 
1S ground state and the 2P,,. and 2P;,, excited 
states of antihydrogen (Fig. 1). The ALPHA 
Collaboration used an approach called laser 
spectroscopy, which involved injecting pulses 
of laser light into the antihydrogen trap. This 
injection caused atoms totransition fromthe 
1S state to the 2P,,, or 2P;,, state and to subse- 
quently decay back to the 1S state. Atoms that 
ended up ina different magnetic substate of 
the 1S state from the one in which they started 
were expelled from the magnetic neutral-atom 
trap. These antihydrogen atoms then annihi- 
lated on contact with regular atoms in the walls 
of the ALPHA apparatus to produce particles 
called charged pions. 

The ALPHA Collaboration plotted the 
number of observed charged pions as a 
function of the frequency of the laser light. 
They then used the positions of the two 
peaks in these plots to infer the 1S-2P,,. and 
1S-2P3,, energy differences in antihydrogen. 
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These differences agree with the ones 
measured in ordinary hydrogen at the level 
of 16 parts per billion. The authors used their 
results to estimate the fine-structure splitting 
(the 2P,,.-2P3,. energy difference) in anti- 
hydrogen, with an uncertainty of 0.5%. This 
value is again in good agreement with the one 
for ordinary hydrogen. 

In 2018, the ALPHA Collaboration measured 
the energy gap between the 1S and 2S states in 
antihydrogen" to one part in 10”. In the cur- 
rent work, the authors combined this result 
with their measurement of the 1S-2P,,. energy 
difference to provide an estimate of the Lamb 
shift in antihydrogen. This value has an uncer- 
tainty of 11% (or 3.3%, when the fine-structure 
splitting in ordinary hydrogen is used in the 
analysis). 

Over the past few years, high-precision laser 
spectroscopy of antihydrogen has become 
possible, and the ALPHA Collaboration has 
achieved spectacular progress. An examina- 
tion of several transitions in antihydrogen 
would enable targeted tests of CPT symme- 
try, quantum electrodynamics and the stand- 
ard model of particle physics. For example, 
a measurement of the Lamb shift with an 
uncertainty of less than one part in10* would 
allow the antiproton charge radius to be deter- 
mined”. Moreover, improved measurements 
of the energy gap between magnetic substates 
in antihydrogen would provide detailed infor- 
mation about the magnetic structure of the 
antiproton®. 

The laser used for spectroscopy in the 
current work will, in the future, be used for 
cooling of antihydrogen by inducing 1S-2P,/. 
and 1S-2P;,, transitions. Such cooling would 
greatly improve the achievable precision in all 
spectroscopy experiments on antihydrogen. 
In addition, ultracold antihydrogen can be 
used to study the effect of gravity on these 
atoms”. Cold antihydrogen thus promises 
many cool results. 
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30 years of the iron 
hypothesis of ice ages 


Heather Stoll 


In 1990, an oceanographer who had never worked on climate 
science proposed that ice-age cooling has been amplified by 
increased concentrations of iron in the sea — and instigated an 


explosion of research. 


Thirty years ago this month, John Martin 
proposed a solution to one of the biggest 
mysteries of Earth’s climate system: how was 
nearly one-third of the carbon dioxide in the 
atmosphere (about 200 gigatonnes of carbon) 
drawn into the oceanas the planet entered the 
most recent ice age, then stored for tens of 
thousands of years, and released again as the 
ice sheets melted? These large natural cyclesin 
atmospheric CO, levels (Fig. 1a) were revealed 
in 1987 by an analysis of ancient air bubbles 
trapped in the first long ice cores taken from 
the Antarctic ice sheet’. Martin recognized 
that iron was a key ingredient that could 
have transformed the surface ocean during 
glacial times. His landmark iron hypothesis’, 
published in Paleoceanography, described a 
feedback mechanism linking climatic changes 
to iron supply, ocean fertility and carbon 
storage in the deep ocean. 

Two hundred gigatonnes is alot of carbon 
to periodically withdraw from and release to 
the atmosphere. In the 1980s, a handful of 
models (see ref. 3, for example) had shown 
that an increase in biomass productionin polar 
ocean regions was the most effective process 
for removing so much atmospheric carbon. 
Photosynthetic organisms inthe surface ocean 
convert CO, from the atmosphere into bio- 
mass, much of which is subsequently broken 
down into CO, again by other organisms and 
returned to the atmosphere. But part of the 
biomass sinks into the deep ocean, which 
therefore effectively serves as a large storage 
reservoir of dissolved CO,. This mechanism 
of CO, removal is called the biological pump. 

However, biomass production requires not 
only CO,, but also other nutrients to build 
lipids, proteins and enzymes. Researchers 
were struggling to ascertain how the ocean’s 
abundance of key nutrients, such as nitrates 
or phosphates, might have increased dur- 
ing glacial times to fuel a stronger biological 
pump. 

Martin argued that iron is another nutri- 
ent that limits the biological pump. He sug- 
gested that the modern marine ecosystem 
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of the Southern Ocean around Antarctica is 
starved ofiron, and therefore relatively lowin 
biomass, despite having abundant nitrates and 
phosphates. But during glacial times, strong 
winds over cold, sparsely vegetated conti- 
nents could have transported large amounts 
of iron-bearing dust into this ocean (Fig. 1b). 
Martin reasoned that this dust could have fer- 
tilized marine ecosystems and strengthened 
the biological pump, so that more carbon was 
transferred into the deep ocean, lowering 
atmospheric CO, levels. 

Around the time of publication, evidence for 
high dust delivery during glacial periods had 
just emerged from studies of deep Antarctic 
ice cores*. But there were no reliable measure- 
ments of dissolved iron inthe Southern Ocean 
that could confirm that its surface waters are 
iron-starved in modern times, or data support- 
ing the proposal that delivery of iron-rich dust 
would make a difference to ocean productiv- 
ity. It was clear, however, that large patches of 
the world’s ocean had much lower quantities 
of biomass than would be expected on the 
basis of the concentrations of key nutrients 
such as nitrates and phosphates. But many 
researchers argued that this was due to natural 
overgrazing of algae by herbivores?. 

The idea that modern algal growth 
is limited by iron availability had, in fact, 
been proposed’ in the 1930s, but had been 
incorrectly discounted by oceanographers — 
who had measured plenty of iron in seawater 
samples collected from the waters around 
their iron ships’. Martin was one of the first 
oceanographers to implement painstaking 
procedures to avoid the contamination of 
samples and to determine that iron con- 
centrations in the north Pacific Ocean were 
extremely low’, certainly low enough to curtail 
biomass production. 

Despite the initial scepticism that greeted 
the iron hypothesis, 12 separate experiments® 
were carried out between 1993 and 2005 in 
which around 300-3,000 kilograms of dis- 
solved iron were injected into small patches 
of the Southern Ocean, the equatorial Pacific 
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Figure 1| The anti-correlated data that inspired the iron hypothesis. a, Measurements of air bubbles 
trapped in cores drilled from the Antarctic ice sheet show that atmospheric levels of carbon dioxide were 
significantly lower during the coldest periods (shaded regions) than during modern times (data from 

ref. 16; CO, concentrations are shown in parts per million by volume; p.p.m.v.). b, The ice-core records 

also reveal that more iron was transported to the Southern Ocean in wind-blown dust during the coldest 
periods than during warmer times (data from ref. 17; iron flux is measured in micrograms per square metre 
per year). In 1990, Martin? hypothesized that the increased levels of iron in the Southern Ocean during the 
coldest periods fertilized the growth of photosynthetic microorganisms in the surface Southern Ocean, 
which therefore produced more biomass from CO). This, in turn, would have increased the strength of the 
biological pump, a mechanism that sequesters some of the biomass (and the carbon within it) in the deep 
ocean. Martin proposed that the stronger biological pump explains why so much atmospheric CO, is drawn 


into the ocean during cold times. 


Ocean and the north Pacific. The biomass of 
algae increased wherever iron was added, as 
biological production surged. 

Unfortunately, Martin died mere months 
before the first of these experiments, and 
did not witness the ocean-scale confirmation 
of his hypothesis, nor the internationally 
coordinated campaign to measure iron geo- 
chemistry throughout the world’s oceans?’ — 
which confirmed iron limitation and revealed 
the intricate strategies used by marine 
ecosystems to acquire and recycle iron”. 

Earth scientists also tried to test the iron 
hypothesis computationally using simple 
ocean models. They used the changes in the 
dust-accumulation rate recorded in ice cores 
as input to simulate changes in iron delivery 
to the Southern Ocean, and data from the 
experimental iron fertilizations to calculate 
how this iron could affect algal growth and the 
biological pump. Such models could repro- 
duce the timing and magnitude of about half 
of the observed decrease in atmospheric CO, 
levels during glacial periods”. Iron fertilization 
is therefore clearly animportant process that 
causes atmospheric changes, but might not 
be the only one. 

Finding data to prove that biological 
production had been higher during glacials 


was a harder task — after all, the ecosystem 
during the most recent glacial period (about 
20,000 years ago) is long dead. One possible 
solution was to extract cores from sedi- 
ments piled on the sea floor, to see whether 
the mineral skeletons of algae accumulated 
faster during glacial times than inthe modern 
era. However, the results were often ambigu- 
ous”, for several reasons: many algae don’t 
produce a preservable skeleton; numerous 
factors determine what proportion of biolog- 
ical remains is preserved onthe sea floor; and 
the location of biological production changes 
through time as ocean fronts and sea-ice 
positions migrate. 

Fortunately, Martin? and others” had 
anticipated an alternative, global-scale test 
of the biological pump during glacial times. 
If more biomass reached the deep ocean dur- 
ing glacials, then deep-sea microorganisms 
would use up more oxygen as they consumed 
it, decreasing the concentration of oxygenin 
deep waters. Evidence of deep-ocean oxygen 
depletion would therefore be indicative of a 
strong biological pump. 

Martin recognized that the presence of 
certain microfossils in glacial-age sediments 
meant that the deep ocean had not become 
completely devoid of oxygen during glacials. 
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Butalthough this evidence crudely constrained 
estimates of the degree to which iron fertili- 
zation might have enhanced productivity 
during glacials, it could not be used to deter- 
mine whether levels of deep-ocean oxygen 
were lower than during modern times. Since 
then, analysis of more-sensitive geochemical 
records indicates that the oxygen concentra- 
tion in bottom waters did decrease during 
glacial times. This provides the strongest 
confirmation yet of the large-scale accumula- 
tion of carboninthe deep ocean during glacial 
periods owing to a stronger biological pump. 

Slower rates of mixing between the deep and 
shallow oceans could also have enhanced the 
biological pump during glacials. The latest 
generation of climate models in which the 
ocean and atmosphere are coupled can test 
the contribution of the multiple processes 
that could have resulted in a reduction in 
bottom-water oxygen levels. Such models 
indicate that mixing rates can account for 
only half of the observed deep-ocean storage 
of CO, during the glacial period, andthatiron 
fertilization of the Southern Ocean is the major 
cause of the extra CO, storage observed”. 

Martin concluded his paper by saying 
that iron availability “appears to have been a 
player” in strengthening the biological pump 
during glacial cycles, but that the size of its 
role remained to be determined. Thirty years 
later, the evidence convincingly shows that 
iron fertilization of the Southern Ocean was 
indeed a leading actor in this global-climate 
feedback. 
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News & views 


Cell biology 


Transfer of ubiquitin 
protein is caught in the act 


Raymond J. Deshaies & Nathan W. Pierce 


Aprocess termed ubiquitination mediates the regulated 
destruction of cellular proteins, thereby preventing disease or 
infection. Structural data now reveal how a crucial regulator of 
ubiquitination enzymes coordinates this process. See p.461 


Many cellular functions that occur in eukary- 
otes (organisms whose cells contain a nucleus) 
are regulated by targeted protein destruction. 
This targeting is often achieved by a process 
called ubiquitination (or ubiquitylation), in 
which a protein selected for destruction is 
tagged with the protein ubiquitin. Ubiquiti- 
nationis aided by enzymes knownas E3 ligases, 
a subset of which are called cullin—RING ubiq- 
uitin ligases (CRLs)!. CRLs help to transfer 
ubiquitin from an E2 conjugating enzyme, to 
which it is bound, onto the target protein’. By 
default, CRLs are inactive, and they are acti- 
vated when a protein called NEDD8 (which 
is similar in sequence to ubiquitin) becomes 
attached to the cullin subunit of the CRL?>. 
But how this activation happens has beena 
mystery. On page 461, Baek etal.® report struc- 
tural data obtained using a technique called 
cryo-electron microscopy (cryo-EM) that fills 
insome of the blanks. 


CRLs contain a banana-shaped cullin 
subunit (one of five cullin proteins, CUL1 to 
CULS). This binds (Fig. 1) at one end toa sub- 
strate-receptor subunit — which recruits the 
protein targeted for ubiquitination — and at 
the other end to what is termed a RING-finger 
protein, whichis either RBX1 or RBX2 (refs 1,7). 
The RING-finger protein recruits a ubiqui- 
tin-attached E2 enzyme and stimulates the 
transfer of its ubiquitin to the target protein’. 
Previous structural analysis demonstrated that 
the attachment of NEDD8 to CULS enhances 
the potential of RBX2 and its ubiquitin-bound 
E2 enzyme to move towards the region adja- 
cent to the substrate receptor and its bound 
target protein*. However, that work used a 
truncated version of CULS bound to RBX2, 
and lacked botha target protein bound tothe 
substrate receptor and a ubiquitin-attached 
E2 enzyme, thus leaving to the imagination 
the mechanism by which NEDD8 stimulates 
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Figure 1 | Structural basis for how ubiquitination is stimulated by the NEDD8 protein. Baek et al.° 

used cryo-electron microscopy to analyse how the ubiquitin protein (Ub) becomes attached toa protein 
that is thereby marked for degradation. a, Ubiquitin binds to the enzyme UBE2D. The protein IkB is a 
ubiquitination target, and binds toa substrate receptor called B-TRCP. This receptor also binds to a protein 
complex consisting of SKP1-CUL1-RBXI, called CRL1, to form a complex termed CRL1°"®°’. The transfer 

of ubiquitin from UBE2D to IB is aided by CRL1°*®“’. The NEDD8 protein tags the WHB domain of CUL1, 
thereby increasing the flexibility of the complex and enhancing ubiquitin transfer. b, The authors describe 
atransition-state complex consisting of three modules: an activation module (the NEDD8-bound WHB 
domain), a catalytic module (ubiquitin, UBE2D and the adjacent part of RBX1) and a substrate-scaffolding 
module (the remaining components). They report that extensive rearrangements of these modules 

occur after NEDD8 binds to CRLI, a finding that helps to explain how NEDD8 enhances ubiquitination. 


(Image based on Extended Data Fig. 2 of ref. 6.) 
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ubiquitin transfer to the target protein. 

Baek and colleagues therefore sought to 
capture ahuman NEDD8-attached CRL in the 
act of transferring ubiquitin to its target pro- 
tein. To achieve this goal, the authors made 
a ‘tribrid’ molecule comprising three com- 
ponents. One component was a stretch of 
amino-acid residues derived from the protein 
IxB, whichis a ubiquitination target that binds 
to a substrate receptor called B-TRCP. The 
second was an E2 enzyme termed UBE2D, and 
the third was ubiquitin. This tribrid provided 
a stable mimic of how the molecular compo- 
nents are arranged during the transition state, 
when ubiquitin is being transferred from the E2 
enzyme to the target protein. Using cryo-EM, 
the authors obtained structural data for the 
complex that formed when the tribrid and 
B-TRCP assembled with the proteins CUL1, SKP1 
and RBX1 (this complex is called CRLI°™®“), 

The type of structural information that can 
be obtained using X-ray crystallography is con- 
strained by technical issues (packing forces 
in the crystals) that affect data collection. 
The cryo-EM approach taken by the authors 
avoids these constraints and enables multiple 
conformations of a structure to be obtained. 
The authors confirmed an earlier finding’ that 
CRL1°™ shows modest conformational flexi- 
bility in the absence of NEDD8, but that this 
flexibility increases when NEDD8 is attached. 
Furthermore, on addition of the tribrid, 
the ensemble of conformations converged 
to form one structure of a transition-state 
intermediate. 

Baek and colleagues’ structural data are 
nothing short of spectacular. Previous work? 
suggested that the active site of the E2enzyme, 
where ubiquitin is transferred to the target 
protein, might not be fixed in location rel- 
ative to the NEDD8-attached CRL because 
of the mobility of the RING-finger protein’s 
RING domain. The transition state presented 
by Baek etal. shows the precise 3D relationship 
of three modules that form the whole complex: 
acatalytic module, an activation module and 
asubstrate-scaffolding module. The catalytic 
module comprises ubiquitin-bound UBE2D 
and the RING domain of RBX1, and this mod- 
ule moves when NEDD8 becomes attached 
to CUL1. The activation module consists of 
a mobile domain in CUL1 called the WHB 
domain, to which NEDD8 attaches. The sub- 
strate-scaffolding module includes B-TRCP 
and portions of CUL1 and RBX1 that have a 
fixed spatial relationship to 8-TRCP and IB. 

In Baek and colleagues’ proposed activated 
structure, the catalytic module projects 
directly towards the substrate-scaffolding 
module, such that UBE2D touches B-TRCP 
(Fig. 1). The activation module coordinates 
the architecture of the transition state, with 
NEDD8 forming multiple contacts between 
UBE2D in the catalytic module and CUL1 in 
the substrate-scaffolding module. These 


interactions stabilize the configurations of 
the WHB and RING domains and bring UBE2D’s 
active site into close proximity with B-TRCP 
and its bound target protein. 

To confirm these findings, the authors 
performed extensive and sophisticated kinetic 
analyses comparing wild-type complexes with 
those containing mutant proteins designed to 
disrupt interactions between the modules. All 
complexes containing a single mutant protein 
showed strongly reduced enzymatic activity 
compared with those of wild-type complexes, 
and complexes containing two mutant pro- 
teins had potent synergistic defects, which 
is consistent with the authors’ model for how 
the complex functions. 

This structure provides information that 
explains many previously confusing or 
contradictory observations. For example, 
it now makes sense why, during a bacterial 
infection, thereis a catastrophic effect on CRL 
function when bacterial enzymes target the 
glutamine 40 amino-acid residue of NEDD8 
(ref. 8). This is because modification of this 
residue would destabilize the activation mod- 
ule. In addition, the structure shows clearly 
how direct contacts between NEDD8 and 
UBE2D that occur away from UBE2D’s cata- 
lytic site’ work together with RBX1 to optimally 
position the catalytic module relative to the 
6-TRCP-bound target protein. 


These structural insights pose new 
questions. Most notably, why does the transfer 
of the first ubiquitin to some CRL substrates 
require an extra RBX1-interacting complex of 
E3 and E2 enzymes (ARIH1 and UBE2L3, respec- 
tively’), given the extraordinary catalytic 
efficiency of the complex reported by Baek 
and colleagues? Moreover, howis the observed 
rapidity of ubiquitin transfer achieved, given 
the proposed requirement for the complex 
to undergo substantial structural rearrange- 
ments to reach the transition state? And what 
might the transition state look like for the 
NEDD8-stimulated process of chain elongation 
(the attachment of further ubiquitin molecules 
tothe initial ubiquitin tag ona target protein), 
considering that ubiquitin-chain elongation is 
mediated by different E2 enzymes" from those 
that add the initial ubiquitin tag? With cryo-EM 
now firmly part of the toolkit for investigat- 
ing ubiquitination, the answers might arrive 
sooner than we thought. 

Importantly, these new structural data 
might help inthe design of drugs knownas pro- 
teolysis-targeting chimaeras (PROTACS), some 
of which can redirect specific CRL enzymes 
to ubiquitinate and thus destroy targets of 
clinical interest that are outside the enzymes’ 
natural repertoire’. These drugs work by 
simultaneously binding substrate receptors 
of CRLs and a target protein. However, the 


formation of such complexes is not always 
sufficient to stimulate ubiquitin transfer”. 
The reason for this might become clear from 
the deeper understanding of CRL-mediated 
ubiquitin transfer gained through the work 
of Baek and colleagues. 
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At the historic Shelter Island Conference on the Foundations of Quantum Mechanics 
in 1947, Willis Lamb reported an unexpected feature in the fine structure of atomic 
hydrogen: a separation of the 2S,,. and 2P,,. states’. The observation of this separation, 
now knownas the Lamb shift, marked an important event in the evolution of modern 
physics, inspiring others to develop the theory of quantum electrodynamics? °. 
Quantum electrodynamics also describes antimatter, but it has only recently become 
possible to synthesize and trap atomic antimatter to probe its structure. Mirroring the 
historical development of quantum atomic physics in the twentieth century, modern 


measurements on anti-atoms represent a unique approach for testing quantum 
electrodynamics and the foundational symmetries of the standard model. Here we 
report measurements of the fine structure in the n =2 states of antihydrogen, the 
antimatter counterpart of the hydrogen atom. Using optical excitation of the 1S—2P 
Lyman-a transitions in antihydrogen®, we determine their frequencies in a magnetic 
field of 1 tesla to a precision of 16 parts per billion. Assuming the standard Zeeman and 
hyperfine interactions, we infer the zero-field fine-structure splitting (2P,,.-2P3/.) in 
antihydrogen. The resulting value is consistent with the predictions of quantum 
electrodynamics to a precision of 2 per cent. Using our previously measured value of 
the 1S-2S transition frequency®”, we find that the classic Lamb shift in antihydrogen 
(2S,.-2P,,, splitting at zero field) is consistent with theory at a level of 11 per cent. Our 
observations represent an important step towards precision measurements of the 
fine structure and the Lamb shift in the antihydrogen spectrum as tests of the charge- 
parity-time symmetry® and towards the determination of other fundamental 
quantities, such as the antiproton charge radius”, in this antimatter system. 


The fine-structure splitting of the n = 2 states of hydrogen is the sepa- 
ration of the 2P;,, and 2P,,, levels at zero magnetic field. This splitting, 
predicted by the Dirac theory of relativistic quantum mechanics", origi- 
nates from the spin-orbit interaction between the non-zero orbital 
angular momentum (L=1) and the electron spin. The ‘classic’ Lamb shift 
is defined as the splitting between the 2S,,, and 2P,,. states at zero field”, 
andisamanifestation of the interaction of the electron withthe quantum 
fluctuations of the vacuum electromagnetic field, an effect explained 
by quantum electrodynamics (QED)”“. Today, it is understood that 
the classic Lamb shift in hydrogen is dominated by the QED effects on 
the 2S energy level, and that the 1S level receives even stronger QED 
corrections than the 2S level’”””’. Although QED corrections in levels 
n#2 are nowalso sometimes referred to as Lamb shifts, in this Article 
we restrict our definition of the Lamb shift to be the classic n =2 shift. 

Ina magnetic field, the Zeeman effect causes the 2P;,, state to also 
split into four sublevels (labelled 2P,, 2P,, 2P.and 2P,), whereas the 2S,,, 
and 2P,,, states each split into two (2S, and 2S,,; 2P, and 2P;). These 
fine-structure levels further split into two hyperfine states owing to 
the proton spin (see Fig. 1 for the expected energy levels for the case 
of antihydrogen, where the spin orientations are reversed with respect 
to those of hydrogen.) 


Lamb’s original work used the then newly developed techniques of 
an excited-state atomic hydrogen beam and resonant microwave spec- 
troscopy to study direct transitions between the n = 2 fine-structure 
states in various magnetic fields. The Lamb shift was then determined 
to 10% precision by extrapolating frequency measurements to zero 
field'. Here, we report the observation of the splitting between the 2P, 
and 2P; states in antihydrogen ina field of1T, by studying laser-induced 
transitions from the ground state. Assuming the validity of the Zeeman 
and hyperfine interactions, and using the value of the previously meas- 
ured 1S-2S transition frequency’, we infer from our results the values 
of the zero-field fine-structure splitting and the classic Lamb shift in 
antihydrogen. Such studies have become possible owing to the com- 
bination of several recent advances: the accumulation” of hundreds 
of anti-atomsin each run, their confinement for many hours", control 
of the hyperfine polarization of the antihydrogen samples” and the 
development of a narrow-line, pulsed, Lyman-a laser®”’. 

Details of the production, trapping and control of antihydrogen in 
the ALPHA experiment have been provided elsewhere®”"* >, so the fol- 
lowing description is brief. The ALPHA-2 apparatus (Fig. 2) incorporates 
acylindrical magnetic trapping volume (about 400 cm?) for neutral 
anti-atoms; the magnetic-field minimum at the centre of the trap was 


*A list of participants and their affiliations appears at the end of the paper 
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Fig. 1| Expected antihydrogen energy levels. Calculated energies of the fine 
structure and the hyperfine sublevels of the 1S,/.,2S,,., 2P3,.and 2P,,. states are 
shownas functions of magnetic-field strength. The spin orientations for 
antihydrogen are shown; they are reversed for hydrogen. The centroid energy 
difference, F,,>,= 2.4661 x 10% Hz, has been suppressed onthe vertical axis. 
Details of the energy levels relevant to this work at a magnetic field of 
B=1.0329 T are shown onthe right. Each state is labelled using conventional 
notation. For the 1S and 2S states, the hyperfine states are labelled with 
subscripts a-d in order of increasing energy (see, for example, ref.’); namely, 
Sa=I*1),Sp=I*U),S.=l¥ 1) and Sq=|v 4), where the ket notation represents the 
positron spin (left; ¥ or *) and antiproton spin (right; J) or }) states in the high- 
field limit. The labels S,,, and S,,are used when the antiproton spins are 
unpolarized. For the 2P states, the fine-structure splittings are labelled with 


set to 1.0329 + 0.0004 T for this work. (All uncertainties given herein 
are 1o.) By combining 90,000 trapped antiprotons from the CERN 
Antiproton Decelerator” and three million positrons froma positron 
accumulator?*”’, about 10-30 cold (below 0.54 K) anti-atoms are con- 
fined in the magnetic trap in a 4-min cycle. Under normal conditions, 
the storage lifetime” of the trapped antihydrogen is greater than 60h, 
which permits loading from repeated cycles* to obtain hundreds of 
antihydrogen atoms in a few hours. 

Two types of antihydrogen samples were used in these studies. The 
positron spin ofan antihydrogen atom confined in the ALPHA-2 trap is 
necessarily polarized, because only the 1S, and 1S, states can be mag- 
netically trapped (Fig. 1). The antiproton spin, on the other hand, is 
unpolarized a priori, with both orientations equally likely. Thus, the 
initial samples are singly spin-polarized. On the other hand, doubly 
spin-polarized samples, in which both the positron and antiproton 
spins are polarized, can be prepared by injecting microwaves to reso- 
nantly drive the 1S. atoms to the untrappable 1S, state (Fig. 1), effectively 
depopulating the 1S, state from the trap”. 

Spectroscopy in the vacuum ultraviolet range is challenging even for 
ordinary atoms, owing in part to the lack of convenient laser sources and 
optical components”**’, Our pulsed, coherent 121.6-nm radiation was 
produced by generating the third harmonic of 365-nm pulses in a Kr/Ar 
gas mixture at a repetition rate of 10 Hz (ref. 8). The typical pulse width 
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subscripts a-fin order of decreasing energy at low magnetic fields, whereas the 
hyperfine splitting due to the antiproton spin is specified by subscripts + and - 
for spin parallel ({}) and anti-parallel ()) to the magnetic field in the high-field 
limit, respectively. The symbol (V,*%) in the figure indicates that the positron 
spin states are mixed for the 2P, and 2P;states. The vertical solid arrows 
indicate the one-photon laser transitions probed here: 1S, > 2P;_ (bold red), 

1S. > 2P,, (thin red), 1S, > 2P,._ (bold blue) and1S, > 2P., (thin blue). The dashed 
red and blue arrows indicate relaxation to the same trappable level, whichis not 
detectable in the present experiment, and the dashed black arrows indicate 
relaxation to untrappable levels, which is detectable via annihilation signals 
(see text). The bold black arrow shows the microwave transition used to 
eliminate 1S. state atoms to prepare a doubly spin-polarized antihydrogen 
sample. 


at 121.6 nmwas 12 ns, and the bandwidth was estimated from the Fourier 
transform of the temporal pulse shape to be 65 MHz (full-width at half- 
maximum, FWHM). The 121.6-nm light was linearly polarized because 
of the three-photon mixing of linearly polarized 365-nm light. In the 
antihydrogen trap, the polarization vector was nearly perpendicular to 
the direction of the axial magnetic field. The laser beam had a radius of 
3.6 mm and was roughly collimated across the trapping region (Fig. 2). The 
average pulse energies in the antihydrogen trapping volume ranged from 
0.44 nj to 0.72 nJ over different runs, as evaluated from the pulse wave- 
forms recorded with a calibrated, solar-blind photomultiplier detector. 

Inthis experiment, single-photon transitions from the 1S, (1S,) states 
to the 2P.., (2P._) and 2P;, (2P;_) states are driven by the 121.6-nm light 
(red and blue arrows in Fig. 1). When antihydrogen is excited to the 
2P... or 2P,, state, it decays to the ground-state manifold within a few 
nanoseconds by emitting a photon at 121.6 nm. The mixed nature of the 
positron spin states in the 2P,, (2P._) and 2P,, (2P,_) states implies that 
these states can decay tothe 1S, (1S,) states via a positron spin flip (black 
dashed arrows in Fig. 1). Atoms in these final states are expelled from 
the trap and are annihilated on the trap walls. Annihilation products 
(charged pions) are in turn detected by a silicon vertex detector” with 
an efficiency greater than 80%. 

Table 1 summarizes our data. In total, four series of measurements 
were performed using either singly or doubly spin-polarized samples. 


Solenoid Electrodes Mirror coils 


Octupole 


365 nm 


Solenoid MgF2 


window 


Air THG cell 
Vacuum 121.6 
Liquid helium cos 


Microwaves 


MoF, 


window 
PMT 


Antiproton 
preparation 


Antihydrogen synthesis 
and trapping 


Fig. 2| The ALPHA-2 central apparatus. A cylindrical trapping volume for 
neutral antimatter with a diameter of 44.35 mm and an axial length of 280 mm 
is located inside several Penning trap electrodes and surrounded by an 
octupole coil, five mirror coils and two solenoids, all superconducting. The 
three-layer silicon vertex annihilation detector is shown schematically in 
green. Laser light (purple line) enters from the positron (e*) side (right) and is 
transmitted to the antiproton (p) side (left) through vacuum-ultraviolet-grade 


The Series 1 data, previously reported in ref. °, have been reanalysed. 
Each series consisted of two or four runs, and in each run about S00 anti- 
hydrogen atoms were accumulated over approximately two hours, typi- 
cally involving over 30 production cycles. The trapped anti-atoms were 
then irradiated for about two hours by atotal of 72,000 laser pulses at 
twelve different frequencies (that is, 6,000 pulses per frequency point 
for each run) spanning the range -3.10 GHz to +2.12 GHz relative to the 
expected (hydrogen) transition frequencies. The laser frequency was 
changed every 20 s in anon-monotonic fashion to minimize effects 
related to the depletion of the sample of antihydrogen. After the laser 
exposure, the remaining antihydrogen atoms were released by shutting 
down the trap magnets, typically in 15 s, and counted via detection of 
their annihilation events. 40-60% of the trapped antihydrogen atoms 
experienced resonant, laser-induced spin flips, and their annihilations 
were detected during the two-hour laser irradiation period. 

Acombination of time-gated antihydrogen detection (enabled by 
the use of a pulsed laser), the accumulation of a large number of anti- 
atoms and the use of supervised machine-learning analysis”’ (based 
onaboosted decision-tree classifier) suppressed the background toa 
negligible level (less than 2 counts per 2-h irradiation period). 

The measured spectra, obtained from counting the laser-induced 
spin-flip events, are shown in Fig. 3 for both singly and doubly spin- 
polarized antihydrogen samples. For eachrun, the probability at each 
frequency point is determined from dividing the number of annihila- 
tion events recorded at that frequency by the total number of trapped 
atoms in that run, and further dividing by the ratio of the average laser 
energy toa standard value of 0.5 nJ. The normalization to the standard 
laser energy is to account for the expected linear dependence of the 
transition probability on the laser power in our regime. The data plotted 
in Fig. 3 are spectrum-averaged over the runs for each series. For the 
singly polarized sample (Fig. 3a), each transition shows a linewidth of 


preparation 


Annihilation detector 
(silicon vertex) 


Positron 


MgF, ultrahigh-vacuum windows. The laser beam crosses the trap axis at an 
angle of 2.3°. The transmitted 121.6-nm pulses are detected by asolar-blind 
photomultiplier tube (PMT) at the antiproton side. Microwaves used to prepare 
the doubly spin-polarized samples are introduced from the positron side 
through a waveguide, shown in blue. The external solenoid magnet for the 
Penning traps is not shown here. THG, third-harmonic generation. 


about 1.5 GHz (FWHM). This is consistent with the expected Doppler 
broadening in our trapping condition (1GHz FWHM) and the hyperfine 
splitting of the 1S—2P, and 1S-2P, transitions (0.71 GHz for both tran- 
sitions). The hyperfine structure cannot be resolved in these singly 
polarized samples owing to the Doppler broadening. 

Figure 3b shows the spectra obtained from doubly spin-polarized 
antihydrogen samples. For these data, microwave radiation of ~28 GHz 
(power ~0.4 W, measured at the trap entrance) was applied before the 
start of optical spectroscopy, in the form of a 9-MHz sweep, covering 
the 1S,-1S, transition in the magnetic-field minimum”. As shown in 
Table 1, about half of the total trapped antihydrogen atoms underwent 
a positron spin-flip and annihilated during microwave irradiation. This 
is consistent with our experience from earlier studies, in which 1S,- 
state atoms were removed with about 95% efficiency”. The spectral 
lines of the 1S-2P transitions in doubly spin-polarized antihydrogen 
(Fig. 3b) are narrower than those in the singly spin-polarized samples 
(Fig. 3a) because the former involves only one hyperfine state in the 
ground state. The peaks are red-shifted because the frequencies of the 
transition from the 1S, state to the 2P, and 2P, states are expected to 
be about 700 MHz lower than those from the 1S, state. The observed 
width of ~1GHz FWHM ofthese lines is in agreement with the Doppler 
width expected for our trapping conditions. 

The procedure used to extract the frequencies of the fine-structure 
transitions and to evaluate their associated uncertainties is described 
in Methods. We summarize the results of this analysis in Table 2. A simu- 
lation was used to model the motion of trapped antihydrogen atoms 
inthe ALPHA-2 trap and their interaction with pulsed laser radiation. 
The resonance transition frequencies were obtained by comparing 
simulated and experimental lineshapes. Extensive investigations were 
performed to evaluate systematic uncertainties in our measurement 
(Table 3). The validity of our analysis procedure was tested by using 


Table 1| Experimental parameters and number of detected events 


Series Sample Transition Number Averagepulse Number of Number of pulses Number of Microwave Laser Counts upon 
polarization probed ofruns energy (pJ) frequencies per frequency trappedatoms counts counts release 

1 Single 1S .4>2Po. 4 600 12 24,000 2,004 - 1197 807 

2 Single 1S .4>2P}. 4 550 12 24,000 2,012 - 1,075 937 

3 Double 1S42P,_ 2 440 12 12,000 1,044 527 229 288 

4 Double 1S4>2P;. 2 720 12 12,000 971 463 341 167 


The experimental parameters, together with the number of antihydrogen events detected during the microwave irradiation, the laser irradiation and the release of the remaining atoms, are 
tabulated for each series. The machine-learning analysis identifies annihilation events with an estimated efficiency of 0.849 for the microwave irradiation, 0.807 for the laser irradiation and 
0.851 for the release of the remaining atoms. The number of counts is corrected for the detection efficiencies. The number of trapped atoms is derived from the sum of the other counts. 
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Fig. 3|1S-2P fine-structure spectrum of antihydrogen. a, b, Experimental 
data (filled circles) and fitted lineshapes for singly spin-polarized (a) and 
doubly spin-polarized (b) antihydrogen samples. The data points were 
obtained from the detected spin-flip events, normalized to the total number of 
trapped antihydrogen atoms, for a laser pulse energy of 0.5nJ. The error bars 
are locounting uncertainties. The frequency is offset by 2,466,036.3 GHz. We 


different lineshape-fitting models. Two representative curve fits are 
shown in Fig. 3. The fit of Model 1 uses a function constrained to fit the 
simulation shape, whereas in Model 2 the shape parameters of this 
function are allowed to vary to best fit the experimental data; see Meth- 
ods for details. The sensitivity of the results to the experimental and 
simulation parameters was tested by repeating the analysis procedure 
for anumber of simulations with varied input. These included the 
initial antihydrogen conditions (such as the initial temperature, the 
quantum state, and the cloud diameter of antihydrogen at formation) 
and laser properties (such as linewidth, beam waist size and beam 
position); see Methods and Extended Data Fig. 1. Other sources of 
systematic uncertainties include the calibration accuracy and a pos- 
sible frequency drift of the wavemeter, frequency shifts of the 730-nm 
amplification laser cavity, and possible incomplete clearing of the 1S. 
state in the preparation of the doubly spin-polarized samples (Table 3 
and Methods). 

Within the uncertainties, the measured transition frequencies agree 
with theoretical expectations for hydrogen for all four series (Table 2, 
Fig. 4). The fact that the four measurements are consistent, despite hav- 
ing different systematics, increases the confidence in our overall results. 
The results can be combined to give atest of charge-parity-time (CPT) 


Table 2 | 1S-2P transition frequencies 


note that no data were taken between the two peaks (-2-12 GHz). The red fit 
curves were obtained via our standard fitting procedure (Model 1), and the blue 
curves were derived from an alternative fitting model (Model 2), illustrating the 
sensitivity of our results to the fitting procedure. See text and Methods for 
detailed discussion. 


invariance in the 1S-2P transitions at the level of 16 parts per billion 
(Fig. 4). 

Fundamental physical quantities of antihydrogen can be extracted 
from our optical measurements of the 1S-2P transitions by combin- 
ing them with our earlier measurement of the 1S-2S transition in the 
same magnetic trapping field’. From the weighted average of the 
results between the singly polarized and doubly polarized measure- 
ments (Table 1), we obtain a 2P, -2P,_ splitting of 14.945 + 0.075 GHz, 
a2S,-2P._ splitting of 9.832 + 0.049 GHz and a 2S,-2P,. splitting of 
24.778 + 0.060 GHz at 1.0329 T (Methods). Only two of these three 
splittings are independent, and they all agree with the values predicted 
for hydrogen inthe same field. 

Ininterpreting our data, we categorize features in the spectrum based 
on the order of the fine-structure constant ain a perturbative series 
expansion in quantum field theory (which is assumed to be valid for the 
purpose of our categorization). Those features that can be described by 
the Dirac theory (the Zeeman, hyperfine and fine-structure effects) are 
referred to as ‘tree-level effects’ and follow from the lower-order terms 
(up to order -a’Ry, where Ry is the Rydberg constant). On the other 
hand, the Lamb shift originates from the so-called ‘loop effects’ (order 
~a?Ry), the calculation of which requires the concept of renormalization 


Sample spin polarization Antihydrogenf,..(exp) (MHz) Hydrogen f,..,(th) (MHz) Difference f,.,(exp) — f,..(th) (MHz) 
1S cq22P os Single 2,466,051,659(62) 2,466,051,625 34 
1S cq22P re Single 2,466,036,611(88) 2,466,036,642 -31 
184>2P,. Double 2,466,051,189(76) 2,466,051,270 -81 
1S y>2Py_ Double 2,466,036,395(81) 2,466,036,287 108 


The experimentally determined transition frequencies for antihydrogen f,,,(exp) (with 10 errors in parentheses) are compared with the theoretically expected values for hydrogen f,,.(th) at a 
magnetic field of 1.0329 T. For the singly spin-polarized data, the centroid of the hyperfine states is given. The transition frequencies for hydrogen were calculated to a precision better than 


1 MHz (Methods). 


378 | Nature | Vol578 | 20 February 2020 


Table 3 | Summary of uncertainties 


1S,22P,_ Doubly 
spin-polarized (MHz) 


Source of uncertainty 


1S,22P,_. Doubly 
spin-polarized (MHz) 


1S,.22P,. Singly 
spin-polarized (MHz) 


1S,4>2P,. Singly 
spin-polarized (MHz) 


Lineshape fit statistics 55 54 45 47 
Fitting-model dependence 24 42 7 62 
Wavemeter drift 30 30 30 30 
Wavemeter offset 18 18 18 18 
730-nm cavity frequency correction 18 18 18 18 
Residual 1S, state atoms in doubly spin-polarized sample 23 16 (0) 

Magnetic field 5 8 5 8 
Total 76 81 62 88 


Estimated uncertainties (10) at 121.6 nm for each transition (Methods). 


to avoid infinities’™“. Each of the measured splittings has different 
sensitivity to different terms. At the level of our precision, the 2P.- 
2P, splitting is sensitive to the tree-level terms with negligible QED 
effects, whereas the 2S-2P, and 2S-2P, splittings are sensitive to the 
field-independent Lamb shift, in addition to the tree-level terms (we 
note that the Lamb shift is predicted to have negligible dependence 
onthe magnetic field). The agreement between our measurement 
and the Dirac prediction for the 2P..-2P,_ splitting supports the con- 
sistency of the tree-level theory in describing the Zeeman, hyperfine 
and fine-structure interactions in the 2P states of antihydrogen. If we 
hence assume that we can correctly account for the tree-level effects 
in our measurements, we can infer from our measured splittings the 
values of the zero-field fine-structure splitting in antihydrogen to be 
10.88 + 0.19 GHz. By combining the current result with the much more 
precisely measured 1S-2S transition frequency in antihydrogen’, we 
obtain a classic Lamb shift of 0.99 + 0.11 GHz (Methods). If we use the 
theoretical value of the fine-structure splitting from the Dirac pre- 
diction (rather than treat it as a parameter), we can derive a tighter 
constraint on the Lamb shift, 1.046 + 0.035 GHz. 

When considering the first measurements on an exotic system such 
as antihydrogen, it is necessary to adopt a framework within which itis 
possible to compare the results to the expectations of well established 
models for normal matter. The choice of which effects can be assumed 
to be true in interpreting the data are, of necessity, somewhat arbitrary. 
The approach illustrated here is based on the order of perturbation 
in the coupling constant a; we have assumed (lower-order) tree-level 
effects in order to extract (higher-order) renormalizable loop effects. 
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Fig. 4| Comparison of antihydrogen and hydrogen transition frequencies. 
The experimentally measured frequencies for the 1S-2P transitions in 
antihydrogenf,.,(exp) are compared with those theoretically expected for 
hydrogen/,..(th) (Table 2). All four measurements are consistent with 
hydrogen, and their average gives acombined test of CPT invariance at 16 parts 
per billion (ppb). The error bars are 1o, and the calculation of the error bar for 
the average takes into account correlated uncertainties (Methods). 


Other approaches are possible in interpreting our data. We note that 
if the standard theory for the hydrogen atom applies to antihydrogen, 
most of the expected QED effect is on the 2S level, rather than on the 
2P level. Furthermore, the 1S level receives approximately n?=8 times 
larger QED corrections than the 2S level; hence, our earlier accurate 
determination of the antihydrogen 1S-2S level difference’ gives strong 
constraints on new interactions within the QED framework. However, it 
is possible that anew effect could show up in the antihydrogen classic 
Lamb shift while satisfying the 1S-2S constraint. See ref. ° for anexample 
inaLorentz-violating effective-field theory framework. 

We have investigated the fine structure of the antihydrogen atom 
inthe n=2 states. The splitting between the 2P. and 2P; states, two of 
the 2P Zeeman sublevels belonging to the/ = 3/2 and/=1/2 manifolds 
(/, total angular momentum), has been observed in a magnetic field 
of 1 T. The energy levels of the 1S-2P transitions agree with the Dirac 
theory predictions for hydrogen at 1T to 16 parts per billion, and their 
difference to 0.5%. By assuming the standard Zeeman and hyperfine 
effects, and by combining our results with the earlier result of IS—2S 
spectroscopy’, we have inferred the zero-field fine-structure splitting 
and the classic Lamb shift in the n = 2 level. 

These observations expand the horizons of antihydrogen studies, 
providing opportunities for precision measurements of the fine struc- 
ture and the Lamb shift—both of which are longstanding goals in the 
field. Prospects exist for considerable improvements in the precision 
beyond this initial determination. With the advent of the ELENA ring 
in 2021, an upgrade to the Antiproton Decelerator with an anticipated 
increase inthe antiproton flux, the statistical uncertainties are expected 
to be dramatically reduced. The development of laser cooling*’ would 
reduce the Doppler width to a level comparable to the natural linewidth, 
which in turn would improve the precision of the frequency determi- 
nation. It would also permit direct experimental determination of the 
hyperfine splitting in the 2P states, for which theoretical values were 
assumed in this study. 

Such measurements will provide tests of CPT invariance that are com- 
plementary to other precision measurements in antihydrogen, such as 
the 1S-2S frequency and the ground-state hyperfine splitting. Further- 
more, a precise value of the classic Lamb shift, combined with that of the 
1S-2S interval, will permit an antimatter-only determination of the anti- 
protonchargeradius”””, without referring to matter measurements—that 
is, independent of the proton charge radius puzzle* °°. These examples 
signify the importance of broad and complementary measurements in 
testing fundamental symmetries. Inthe absence of compelling theoreti- 
calarguments to guide the way to possible asymmetries, it is essential to 
address the antihydrogen spectrum as comprehensively as is practical. 
Finally, the results reported here demonstrate our capability to precisely 
and reproducibly drive vacuum ultraviolet transitions on a few anti- 
atoms, and indicate our readiness for laser cooling of antihydrogen”, an 
eagerly anticipated development in antimatter studies with far-reaching 
implications for both spectroscopic and gravitational studies*. 


Nature | Vol578 | 20 February 2020 | 379 


Article 


Online content 


Any methods, additional references, Nature Research reporting sum- 
maries, source data, extended data, supplementary information, 
acknowledgements, peer review information; details of author con- 
tributions and competing interests; and statements of data and code 
availability are available at https://doi.org/10.1038/s41586-020-2006-5. 


1: Lamb, W.E., Jr & Retherford, R. C. Fine structure of the hydrogen atom by a microwave 
method. Phys. Rev. 72, 241-243 (1947). 

2. Tomonaga, S. Ona relativistically invariant formulation of the quantum theory of wave 
fields. Prog. Theor. Phys. 1, 27-42 (1946). 

3. Schwinger, J. On quantum-electrodynamics and the magnetic moment of the electron. 
Phys. Rev. 73, 416-417 (1948). 

4. Feynman, R. P. Space-time approach to quantum electrodynamics. Phys. Rev. 76, 
769-789 (1949). 

5. | Schweber, S. S. QED and the Men who Made it: Dyson, Feynman, Schwinger, and 
Tomonaga (Princeton Univ. Press, 1994). 

6. Ahmadi, M. et al. Observation of the 1S-2P Lyman-a transition in antihydrogen. Nature 
561, 211-215 (2018). 

7. Ahmadi, M. et al. Characterization of the 1S-2S transition in antihydrogen. Nature 557, 
71-75 (2018). 

8. Kostelecky, V. A. & Vargas, A. J. Lorentz and CPT tests with hydrogen, antihydrogen, and 
related systems. Phys. Rev. D 92, 056002 (2015). 

9. Crivelli, P., Cooke, D. & Heiss, M. W. Antiproton charge radius. Phys. Rev. D 94, 052008 
(2016). 

10. Eriksson, S. Precision measurements on trapped antihydrogen in the ALPHA experiment. 
Philos. Trans. Royal Soc. A 376 20170268 (2018). 

11. Dirac, P. A. M. The quantum theory of the electron. Proc. R. Soc. A 117, 610-624 (1928). 

12. Kinoshita, T. Quantum Electrodynamics (World Scientific, 1990). 

13. Karshenboim, S. G. Precision physics of simple atoms: QED tests, nuclear structure and 
fundamental constants. Phys. Rep. 422, 1-63 (2005). 

14. Brodsky, S. J. & Parsons, R. G. Precise theory of the Zeeman spectrum for atomic 
hydrogen and deuterium and the Lamb shift. Phys. Rev. 163, 134-146 (1967). 

15. Ahmadi, M. et al. Antihydrogen accumulation for fundamental symmetry tests. Nat. 
Commun. 8, 681 (2017). 

16. Capra, A. & ALPHA Collaboration. Lifetime of magnetically trapped antihydrogen in 
ALPHA. Hyperfine Interact. 240, 9 (2019). 

17. Ahmadi, M. et al. Observation of the hyperfine spectrum of antihydrogen. Nature 548, 
66-69 (2017); erratum 553, 530 (2018). 

18. Michan, J. M., Polovy, G., Madison, K. W., Fujiwara, M. C. & Momose, T. Narrowband solid 
state VUV coherent source for laser cooling of antihydrogen. Hyperfine Interact. 235, 
29-36 (2015). 

19. Andresen, G. B. et al. Trapped antihydrogen. Nature 468, 673-676 (2010). 

20. Andresen, G. B. et al. Confinement of antihydrogen for 1,000 seconds. Nat. Phys. 7, 
558-564 (2011). 

21. Ahmadi, M. et al. Observation of the 1S-2S transition in trapped antihydrogen. Nature 541, 
506-510 (2017). 

22. Ahmadi, M. et al. Enhanced control and reproducibility of non-neutral plasmas. Phys. Rev. 
Lett. 120, 025001 (2018). 

23. Maury, S. The antiproton decelerator: AD. Hyperfine Interact. 109, 43-52 (1997). 

24. Murphy, T. J. & Surko, C. M. Positron trapping in an electrostatic well by inelastic collisions 
with nitrogen molecules. Phys. Rev. A 46, 5696-5705 (1992). 

25. Surko, C. M., Greaves, R. G. & Charlton, M. Stored positrons for antihydrogen production. 
Hyperfine Interact. 109, 181-188 (1997). 

26. Luiten, O. J. et al. Llyman-a spectroscopy of magnetically trapped atomic hydrogen. Phys. 
Rev. Lett. 70, 544-547 (1993). 

27. Eikema, K. S. E., Walz, J. & Hansch, T. W. Continuous coherent Lyman-a excitation of 
atomic hydrogen. Phys. Rev. Lett. 86, 5679-5682 (2001). 

28. Gabrielse, G. et al. Llyman-a source for laser cooling antihydrogen. Opt. Lett. 43, 
2905-2908 (2018). 


380 | Nature | Vol578 | 20 February 2020 


29. Stracka, S. Real-time detection of antihydrogen annihilations and applications to 
spectroscopy. EPJ Web Conf. 71, 00126 (2014). 

30. Donnan, P. H., Fujiwara, M. C. & Robicheaux, F. A proposal for laser cooling antihydrogen 
atoms. J. Phys. B 46, 025302 (2013). 

31. Pohl, R. et al. The size of the proton. Nature 466, 213-216 (2010). 

32. Beyer, A. et al. The Rydberg constant and proton size from atomic hydrogen. Science 
358, 79-85 (2017). 

33. Fleurbaey, H. et al. New measurement of the transition frequency of hydrogen: 
contribution to the proton charge radius puzzle. Phys. Rev. Lett. 120, 183001 (2018). 

34. The ALPHA Collaboration & Charman, A. E. Description and first application of a new 
technique to measure the gravitational mass of antihydrogen. Nat. Commun. 4, 1785 (2013). 


Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in 
published maps and institutional affiliations. 


Open Access This article is licensed under a Creative Commons Attribution 

BY 4.0 International License, which permits use, sharing, adaptation, distribution 
and reproduction in any medium or format, as long as you give appropriate credit to the 
original author(s) and the source, provide a link to the Creative Commons license, and indicate 
if changes were made. The images or other third party material in this article are included in 
the article’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the article's Creative Commons license and your 
intended use is not permitted by statutory regulation or exceeds the permitted use, you will 
need to obtain permission directly from the copyright holder. To view a copy of this license, 
visit http://creativecommons.org/licenses/by/4.0/. 


© The Author(s) 2020 


The ALPHA Collaboration 


M. Ahmadi’, B. X. R. Alves?, C. J. Baker®, W. Bertsche**, A. Capra®, C. Carruth’, C. L. Cesar®, 
M. Charlton®, S. Cohen’®, R. Collister®, S. Eriksson*, A. Evans", N. Evetts", J. Fajans’, T. 
Friesen?", M. C. Fujiwara®™, D. R. Gill®, P. Granum2, J. S. Hangst?™, W. N. Hardy", M. E. 
Hayden”, E. D. Hunter’, C. A. Isaac®, M. A. Johnson**, J. M. Jones’, S. A. Jones”, S. 
Jonsell"’, A. Khramov®", P. Knapp®, L. Kurchaninov®, N. Madsen‘, D. Maxwell’, J. T. K. 
McKenna?*, S. Menary“, J. M. Michan®", T. Momose™®™, J. J. Munich”, K. Olchanski®, A. 
Olin®”, P. Pusa’, C. @. Rasmussen’, F. Robicheaux®, R. L. Sacramento®, M. Sameed’, E. 
Sarid", D. M. Silveira®, C. So*"°, D. M. Starko™, G. Stutter’, T. D. Tharp’®”°, R. I. 
Thompson®”, D. P. van der Werf?! & J. S. Wurtele” 


‘Department of Physics, University of Liverpool, Liverpool, UK. Department of Physics and 
Astronomy, Aarhus University, Aarhus, Denmark. “Department of Physics, College of 
Science, Swansea University, Swansea, UK. “School of Physics and Astronomy, University of 
Manchester, Manchester, UK. °Cockcroft Institute, Warrington, UK. 5TRIUMF, Vancouver, 
British Columbia, Canada. ’Department of Physics, University of California at Berkeley, 
Berkeley, CA, USA. ®Instituto de Fisica, Universidade Federal do Rio de Janeiro, Rio de 
Janeiro, Brazil. Department of Physics, Ben-Gurion University of the Negev, Beer-Sheva, 
Israel. ‘Department of Physics and Astronomy, University of Calgary, Calgary, Alberta, 
Canada. "Department of Physics and Astronomy, University of British Columbia, Vancouver, 
British Columbia, Canada. "Department of Physics, Simon Fraser University, Burnaby, British 
Columbia, Canada. “Department of Physics, Stockholm University, Stockholm, Sweden. 
“Department of Physics and Astronomy, York University, Toronto, Ontario, Canada. “Ecole 
Polytechnique Fédérale de Lausanne (EPFL), Swiss Plasma Center (SPC), Lausanne, 
Switzerland. "Department of Chemistry, University of British Columbia, Vancouver, British 
Columbia, Canada. "Department of Physics and Astronomy, University of Victoria, Victoria, 
British Columbia, Canada. “Department of Physics and Astronomy, Purdue University, West 
Lafayette, IN, USA. "Soreq NRC, Yavne, Israel. ?°Physics Department, Marquette University, 
Milwaukee, WI, USA. “IRFU, CEA/Saclay, Gif-sur-Yvette, France. “e-mail: Makoto.Fujiwara@ 
triumf.ca; jeffrey.hangst@cern.ch; momose@chem.ubc.ca 


Methods 


Transition-frequency determination 

The observed 1S-2P transition spectra have asymmetric shapes with 
a low-frequency tail caused by Zeeman shifts in the inhomogeneous- 
magnetic-field regions away from the centre of the ALPHA-2 trap. As 
a result, the apparent peak of the observed spectrum is shifted toa 
slightly lower frequency with respect to the resonance transition fre- 
quency f,.,, which is defined for atoms in resonance at the magnetic- 
field minimum of the trap. This offset is relatively small (of the order of 
50 MHz). Nonetheless, we performed extensive analysis to understand 
the effects of this asymmetry on our transition-frequency determina- 
tion. The details of the analysis follow. 

A detailed simulation was used to model the motion of trapped 
antihydrogen atoms in the ALPHA-2 trap, as well as their interaction 
with pulsed laser radiation. Aspects of our simulation have been vali- 
dated in previous studies (for example, refs. °°"), To determine 
the resonance transition frequency, we first simulated lineshapes for 
the transitions from the two trappable 1S hyperfine states to the 2P. 
and 2P, excited states (that is, for four transitions: 1S, > 2P,,,1S,> 2P;., 
1S,>2P,.and 1S, > 2P;_). We then fitted each component with an asym- 
metric lineshape function, referred to as GE. GEis a Gaussian spliced to 
an exponential low frequency tail, where the derivative of the crossover 
point is required to be continuous. GE has four parameters: the peak 
frequency (f,.2.) and the width (W) of the Gaussian, the crossover point 
frequency (f,) and the overall amplitude (A). From the fit, we determined 
the simulated lineshape parameters f,,...(sim), W(sim), f, (sim) and 
A(sim) for each transition. In addition, we derived the peak frequency 
offset Af, defined as Af=fj.ax(Sim) —f,..(th), wheref,..(th) is the expected 
theoretical resonance frequency for hydrogen inthe magnetic field B. 

The experimentally observed spectra were then fitted with GE line- 
shapes. Asum of two GEs was used to fit singly spin-polarized samples, 
where only f,.., and a single normalization factor were used as the fit- 
ting parameters, whereas the rest of the parameters (that is, the W 
and f, values of each GE, the spacing of f,., between two GEs, and the 
ratio of the amplitudes A of two GEs) were fixed to the corresponding 
simulated values. For doubly spin-polarized samples, the experimental 
spectra were fitted with a single GE lineshape. In these fits, Wand /, were 
fixed using a fit to the simulated spectrum in which an estimated 5% 
contamination of the 1S, component was assumed. The experimental 
transition frequency is given by f,.<(€xP) =fpeax(eXP) — Af, where fea. (EXP) 
is the peak frequency of the experimental data obtained by the fit. Here 
Afcorrects for the asymmetric lineshape as described earlier. The red 
lines (labelled as ‘Model 1’) in Fig. 3 show the results of these fits using 
standard simulations. We note that the transition to the 2P, state is 
allowed when the laser polarization is not perfectly perpendicular to the 
Bfield. This could arise from the slight angle between the laser and the 
magnetic field (maximum 4° at the edge of our trap) or froma possible 
nonlinear component inthe polarization of the 121-nm light (expected 
to be of the order of 10% or less). The frequency of the 1S-2P, transition 
is well separated from that of the 1S-2P. transition (by about —3.5 GHz), 
and its predicted intensity is very small (less than a few per cent of that 
for the 1S-2P. transition), hence it was ignored in the analysis. 


Transition-frequency uncertainties 
Extensive studies were performed to quantify the uncertainties in 
our frequency determination. The standard simulated spectra repro- 
duce the observed lineshape reasonably well without any fine-tuning 
(Extended Data Fig. 1). The sensitivity of the obtained resonance fre- 
quency f,..(exp) to the input parameters in the simulation was studied 
by varying these input parameters and repeating the same analysis. 
The standard input to the simulation and the range of the parameters 
studied (given in parentheses), were as follows. Laser pulse energies, 
500 pJ (350 p], 800 pj); laser line linewidth, 65 MHz (50 MHz, 80 MHZ); 
relative magnitude of the laser sideband (present at +90 MHz with 


respect to the main band owing to multimode lasing in the 730-nm 
amplification cavity), 10% (0%, 25%); radial position displacement of the 
laser beam: 0 mm (0mm, 3 mm); initial quantum state of antihydrogen 
at formation: n= 30 (1, 30); initial diameter of the cloud of antihydro- 
gen: 0.45 mm (0.45 mm, 0.90 mm); temperature of antihydrogen at 
formation (before trapping): 15 K (1K, 15K). 

Analternative fitting method was also used to study the robustness 
of our procedure. Here, the lineshape function GE was fitted to the data 
without using constraints from the simulated spectrum. From the fit, 
Soeax(Xp) was extracted for each transition, and the experimental reso- 
nance frequency was determined as/,.,(€xP) =fpeax(exp) — Af, where the 
offset Affrom the standard simulation was assumed. The lineshapes 
given by these fits are shown by blue lines (labelled as ‘Model 2’) in Fig. 3. 

The results of the analyses using the simulations with varied input 
parameters, as well as alternative fitting models, are given by red lines 
in Extended Data Fig. 1, which illustrates that the dependence on the 
details of the fitting procedure is small. The variations of the extracted 
frequency f,,,(exp) in these studies (both with different simulation 
inputs and different fitting methods) were generally within the statisti- 
cal uncertainties of these fits. We took the largest deviations inf,.,(exp) 
among these studies as a measure of the fitting-model dependence 
(Table 3). 

It should be noted that our evaluation of the fitting-model depend- 
ence systematics relies on the GE model being a reasonable representa- 
tion of the simulated data. This agreement is qualitatively illustrated 
in Extended Data Fig. 1. Quantitatively, for the simulations with the 
standard input parameters, the x’ per degree of freedom (DOF) ranges 
from 1.2 to 2.5 (with an average of 1.8), where DOF = 8. When the input 
parameters are varied in the fits to the data, the x’ per DOF ranges from 
1.0 to 3.9, with an average of 2.1. The simulation statistics were roughly 
a factor of 2-4 greater than the data; hence, the uncertainties arising 
from our analytical model of the simulation lineshape are small. 

The sources of uncertainty in the transition frequencies can be sum- 
marized as follows (we note that the frequency uncertainties at 
730 nm should be multiplied by a factor of 6 to give those at 121 nm): 
(a) Wavemeter drift: this is due to temperature-induced drift of the 
wavemeter readings, which was estimated from offline studies to be 
about 20 MHzK ‘at 730 nm. Given the recorded temperature variation 
of +0.25 K, we assigned an error of +5 MHz at 730 nm. We note that a 
temperature drift during our 2-h measurements would result in a broad- 
ening of the observed linewidth. This effect would be also taken into 
account partly by the fitting-model uncertainty discussed above. 
Therefore, there is a possibly of partial double counting, but we con- 
servatively list both effects separately. (b) Wavemeter offset: an offset 
of the He-Ne laser calibration source, estimated to be +3 MHz at 730 nm 
by offline calibration. (c) 730-nm cavity resonance-frequency correc- 
tion: the frequency of the generated 730-nm pulse was measured to 
be shifted from that of the continuous-wave 730-nm seed laser. This 
shift of about 10 MHz at 730 nm was regularly monitored, and was 
corrected for in our frequency determination. We conservatively assign 
anerror of 10/,/12 = 3 MHz to this correction at 730 nm (the standard 
deviation of a uniform distribution with a width of 10 MHz). (d) Resid- 
ual 1S, state contamination: our earlier studies with shorter running 
times” indicate there is a residual population of the order of 5% of 
the 1S, state after the microwave-driven clearing procedure, which was 
corrected for in the analysis above. We estimate the error in this cor- 
rection by analysing the data assuming no residual 1S, population. We 
take 68% of the differences between the two analysis results (33.5 MHz 
and 24 MHz for the 2P. and 2P, transitions, respectively) as louncertain- 
ties in the correction. (e) Magnetic field: the field at the magnetic 
minimum of the ALPHA-2 trap was measured in situ using the electron 
cyclotron resonance (ECR) method*. A conservative uncertainty of 
10 MHz in the ECR measurement gives a B field error of 3.6 x 10*T, 
which in turn gives frequency errors of 5 MHzand 8 MHz for the 1S-2P. 
and 1S-2P, transitions, respectively, at 1 T. We take these values as a 
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measure of the uncertainty due both to the absolute value and to the 
run-to-run stability of the B field. We note that the frequency uncer- 
tainty in the 1S-2S transition due to B-field variations is negligible for 
our purposes". (f) Statistical uncertainties of the fit: these represent 
statistical uncertainties in the fit both from the experimental data and 
from the simulations. (g) Model uncertainties: described above. 

The total errors for each transition are given by the quadratic sum of 
errors (a)-(g). Care must be taken when taking an average or a differ- 
ence of the transition frequencies. Here we assume that error (b), the 
wavemeter offset, introduces acommon offset to all the data series. 
The other errors are assumed to be uncorrelated across the dataset. 
The resulting combined uncertainty for the transition frequencies of 
antihydrogen is 39 MHz or 16 ppb (Fig. 4, average value). We expect 
that virtually all of the uncertainties can be considerably reduced in 
the near future owing to increased statistics and improved control of 
the systematics. 


Determination of the fine-structure splitting and the Lamb shift 
of antihydrogen 

To analyse the Zeeman-shifted energy levels of antihydrogen inthe 2P 
state, we used the following Hamiltonian for the 2P state, which includes 
the field-free Hamiltonian (Ao), the fine-structure Hamiltonian (As), 
the Zeeman Hamiltonian (4,) and the hyperfine-structure Hamiltonian 
(Ape): 


H=Ao+ Ag+ Az + Ang (1) 
~ 2_(1 
Ays = 3d pale : Se + | (2) 


. 2 (2P 2u, 
A,=- Hel 5, -B- 1, 


p A Se AP ‘B (3) 


Fy = y-Let S| Ip Se—3Ulp-(Se-0)| (4) 
Here, Lzis the orbital angular momentum of the positron, Sis the spin 
angular momentum of the positron, I, is the nuclear spin angular 
momentum of the antiproton and ris the position vector of the posi- 
tron. €,,is the fine-structure splitting of antihydrogen at zero field. The 
magnetic moments of the positron and antiproton are given by 


_ Bsi lelh a? _ Bpl tain 
y_(2P)= 2 Bt -$) 2 tmp 


and Hy= where g. and g, are the 


positron spin and antiprotong-factors, respectively, @ and mz are the 
charge and mass of the positron, correspondingly, and ais the fine- 
structure constant. The last term of equation (4) isthe Zeeman interac- 
tion due to the orbital angular momentum of the positron with 


Me \lelt Where m., is the mass of the 
Mp ) 2me P 


magnetic moment of ji, = - (1 - 
antiproton.C, isthe hyperfine-coupling constant due to the antiproton 
spin and the orbital angular momentum of the positron, and C,, is the 
hyperfine interaction due to the magnetic dipole-dipole interaction. 

For the analysis of the classic Lamb shift (€, ap) and the fine-structure 
(€;,) parameters of antihydrogen, we assumed that the absolute values 
of the three magnetic moments (He Hy and Hy) are the same as those 
of hydrogen. Previous measurements of the basic properties of anti- 
particles are consistent with this assumption. The hyperfine-coupling 
constants are also assumed to be those of hydrogen*’, G, = 22.2 MHz 
and C,,=— 22.2 MHz. 

Our measurements determine the energy levels, with respect to the 
1S ground state, of two of the Zeeman sublevels in the n = 2 positronic 
manifold of antihydrogen at a magnetic field of 1.0329 T. Specifically, 
the 2P;state belongs to the 2P,,, manifold, and the 2P, state belongs to 
the 2P;,, manifold (see Fig. 1). We combine these results with our previ- 
ous measurement of the 1S,-2S, transition’ and assume the validity of 


the standard Zeeman and hyperfine interactions to derive the fine- 
structure splitting €;, (that is, the energy difference between 2P,,. and 
2P;3,), and the classic Lamb shift €,,,,, (that is, the energy difference 
between 2S,,, and 2P,,,), both defined at zero field. 

Taking into account the hyperfine splitting, we find the energy separa- 
tion between the 2P._ and 2P,. levels at 1.0329 T to be 14.945 + 0.0975 GHz, 
fromthe difference of the weighted average values of the observed tran- 
sition frequencies. Furthermore, we obtain the separation between the 
2S, and 2P,_ levels to be AF(2S, 2P.) = 9,832 + 49 MHz, and that between 
the 2S, and 2P,_ levels to be AF(2S, 2P;) = 24,778 + 60 MHz, inthe same 
field. The sum and the difference of the two quantities, AE(2S, 2P.) and 
AE(2S, 2P,), can be expressed by the following equations, which are 
based onthe standard Hamiltonian of the hydrogen atomina magnetic 
field B (refs. *°°”). We neglect terms that contribute less than 1 MHz. 


AE(2S, 2P;) - AE(2S, 2P,) = 


3 ss 5 
2E(— B) + 5 <ucos(20) + pleos(20) - 3/2 sin(20)]Gs (S) 
AE(2S, 2Pr) + AE(2S, 2P.) = 
13 3, 1 : (6) 
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Here, Ey,(2S) is the hyperfine splitting in the 2S state at zero field. 

Finally, using the CODATA 2014 values of the fundamental constants 
for the hydrogen atom”, the fine-structure splitting €,,-and the classic 
Lamb shift €, , ,, of the antihydrogen atom are determined by numer- 
ically solving equations (5) and (6) with the measured energy-level 
differences given in Table 2 as input. 


Hydrogen transition frequencies in a magnetic field 

From zero-field measurements in hydrogen for the 1S,.-2S,, (ref. *°), 
28 y2-2P 2 (ref. ”) and 2P,,.-2P3,, (ref. ”) transitions, we obtain hyperfine 
centroid frequencies of 


1S-2P,/. transition: 2,466,060,355 MHz 
1S-2P,,, transition: 2,466,071,324 MHz 


The transition frequencies at 1.0329 T (Table 2) are calculated by 
evaluating corrections assuming the standard Zeeman, fine-structure 
and hyperfine interactions in a magnetic field**”’ and using the current 
CODATA values of the fundamental constants”. The precision of our 
calculations is better than 1 MHz. 

In comparing the hydrogen values with the measured antihydrogen 
frequencies in Table 2 and Fig. 4, the value of the magnetic field was 
assumed to be exact for the hydrogen case. 


Data availability 


The datasets generated and/or analysed during the current study are 
available fromJ.S.H. on reasonable request. 
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Topological physics relies on the structure of the eigenstates of the Hamiltonians. 
The geometry of the eigenstates is encoded in the quantum geometric tensor'— 
comprising the Berry curvature? (crucial for topological matter)? and the quantum 


metric’, which defines the distance between the eigenstates. Knowledge of the 
quantum metric is essential for understanding many phenomena, suchas 
superfluidity in flat bands’, orbital magnetic susceptibility®’, the exciton Lamb shift® 
and the non-adiabatic anomalous Hall effect®’. However, the quantum geometry of 
energy bands has not been measured. Here we report the direct measurement of 
both the Berry curvature and the quantum metric in a two-dimensional continuous 
medium—a high-finesse planar microcavity’°—together with the related anomalous 
Hall drift. The microcavity hosts strongly coupled exciton—photon modes (exciton 
polaritons) that are subject to photonic spin-orbit coupling” from which Dirac cones 
emerge”, and to exciton Zeeman splitting, breaking time-reversal symmetry. The 
monopolar and half-skyrmion pseudospin textures are measured using polarization- 
resolved photoluminescence. The associated quantum geometry of the bands is 
extracted, enabling prediction of the anomalous Hall drift, which we measure 
independently using high-resolution spatially resolved epifluorescence. Our results 
unveil the intrinsic chirality of photonic modes, the cornerstone of topological 
photonics” ». These results also experimentally validate the semiclassical description 


of wavepacket motion in geometrically non-trivial bands 


°16 The use of exciton 


polaritons (interacting photons) opens up possibilities for future studies of 
quantum fluid physics in topological systems. 


One of the key manifestations of topological effects in physics is the 
conductance quantization in the two-dimensional (2D) quantum Hall 
effect. This perfect quantization relies ona topological invariant char- 
acterizing the global band properties: the Chern number. Non-zero 
Chern numbers are associated with the chiral conducting edge states 
in topological insulators and superconductors’. Beyond electronic 
systems, topological band concepts have been extended to various 
wave systems covering photonics, acoustics”, cold atoms" and 
even geophysics”. 

Topological effects are not encoded in the energy spectrum of the 
system but rely on the non-trivial geometry of the eigenstates. It is 
the gauge-invariant quantum geometric tensor (QGT) that contains 
the structural information about the eigenstates of a parametrized 
Hamiltonian. The QGT has a symmetric real part which defines the 
quantum metric characterizing distances between states‘ ina param- 
eter space. Its antisymmetric imaginary part determines the Berry 
curvature” whose momentum space distribution is crucial in modern 
Physics. Locally, the Berry curvature is responsible for the anoma- 
lous Hall transport” in the intrinsic spin Hall and valley Hall effects. 


Its integral over a2D closed manifold gives the Chern number. Addition- 
ally, the quantum metric, related to the concept of fidelity in quantum 
information theory, also describes important physical phenomena. 
It can probe quantum phase transitions when defined in an arbitrary 
parameter space”’. The momentum-space metric affects the electronic 
orbital magnetic susceptibility®’ in crystals and the exciton Lamb shift 
in transition-metal dichalcogenides®. It characterizes superfluidity 
and the current of bogolons (superfluid excitations) in flat bands*”! 
and also corrections to the semiclassical equations describing the 
anomalous Hall effect®’. 

The extension of topological concepts from solid state physics to 
other classical or quantum systems has opened up possibilities for 
measuring the local geometrical properties of bands, notjust the global 
properties (such as the conductivity measured in the quantum Hall 
effect). Several protocols have been proposed to measure the Berry 
curvature~”*, Experimental reconstructions via indirect dynamical 
measurements have been reported”*”. The parameter space geometry 
of two-level systems has been explored experimentally even more 
recently”®”’, In this work, we present a direct measurement of the 
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Fig. 1| Emergence of pseudospin monopoles from photoluminescence at 
OT..a, Pseudospin (S) orientation on the Poincaré sphere, parametrized by 8 
and @ and given by the emission polarization degree on the HV, DA and circular 
RL basis; equation (5). b, Eigenmode energies (zero at about 1,601.5 meV) along 
k,and ky. Inset, energy splitting of eigenmodes. Points, experiment; pink and 
blue lines, dispersions of the two branches fitted with equation (1).c,d, Degree 


full momentum space QGT (Berry curvature and quantum metric) 
of the 2D bands of a homogeneous system. Furthermore, we meas- 
ure independently the dynamics of an accelerated wavepacket which 
demonstrates the anomalous Hall drift. This drift is reproduced by 
the semiclassical equations of motion?” including the measured band 
geometry as an input parameter. 

Our experimental platform is a high-quality planar microcavity 
(quality factor, Q> 10°) with embedded quantum wells that support 
2D strongly coupled exciton-photon bands (see Supplementary 
Fig. 6)'°. Each band is doubly polarization degenerate and forms a 
pseudo-spinor. The polarization degeneracy is lifted by the photonic 
splitting between the transverse electric (TE) and transverse magnetic 
(TM) eigenmodes” and, under magnetic field, by the exciton Zeeman 
splitting. The polarized polariton eigenstates are exactly determined by 
a Fourier space mapping of polarization-resolved photoluminescence. 
They exhibit non-zero Berry curvature and anon-zero quantum metric 
as discussed below. 


Theoretical background 

Before presenting the measurements, let us recall the properties of 
an effective two-band Hamiltonian*’ describing, in the conservative 
limit, the lower polariton branch in the circular polarization basis”: 


2,2 
TEE 4 a-Bk?e 2? 
_|2me ~? 
se nk? a 
2,2i9 = 
a-pke aan A, 


where m*=mm,/(m,+ m,) and m,and m, are the longitudinal and trans- 
verse effective masses; k=|k| = ./kj,+ ky is the in-plane wavevector 
(ky=kcos@, ky=ksing, and gis the propagation angle); A, is the polari- 
ton Zeeman splitting owing to the excitonic part; and ais the optical 
birefringence, unavoidable in crystalline systems. The birefringence 
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of polarization of the lower mode in k space for HV (c) and DA (d). The crosses 
mark the degeneracy points. e, k-space in-plane pseudospin (S,, S,) texture. 

f, Amagnification of one of the crossing points that demonstrates a monopole 
texture. The lengths of the arrows ine, f, are in arbitrary units. g, Quantum 
metric tensor trace (gj, + Zyy), With peak around the monopoles. 


leads to a k-independent splitting between horizontally and verti- 
cally polarized states. 6 quantifies the k-dependent TE-TM splitting, 
ubiquitous in 2D photonic systems, which makes 2D photonic bands 
geometrically non-trivial; and his the reduced Planck’s constant. This 
2x2 Hamiltonian can be decomposed into a linear combination of Pauli 
matrices, describing the interaction between an effective magnetic 
field and a pseudospin: 
2,2 
= 21 +06) ‘oO 


(2) 


where ois a vector of Pauli matrices (the spin operator), the average 
value of whichis S =(o). Sis here the polarization pseudospin of light. 
The effective field Q(k) reads: 


a- Bk? cos2p 
Ak) = Bk? sin2@ 
A 


(3) 


Zz 


The pseudospins of the eigenstates are parallel and antiparallel with 
the effective field. The QGT components g;and B, are linked to the 
variation of the pseudospin orientation in k space as”: 


1 . 
8 = 4 (24,0040 + sin?60,00,,) (4a) 


B,= 5, Sin6(2,,60,,0- 2,,00,0) (4b) 


with g, the metric components and B, the Berry curvature in the z 
direction, which cancel if the TE-TM spin-orbit coupling vanishes 
(B=0). 0(k) and @(k) are, respectively, the polar and azimuthal angles 
parametrising the eigenstate w = [cos(6/2)e", sin(6/2)]' (where 
T indicates the transpose) and the pseudospin position on 
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Fig. 2| Broken time-reversal symmetry: emergence of half-skyrmion 
pseudospin textures, from photoluminescence at 9 T. a, Energy dispersion 
along k,,and k, (zero at about 1,601.9 meV). Anticrossing of the branches is 
observed instead of their crossing. The polarization bands are split for all 
wavevectors (see inset, where along k,, the splitting has anon-zero minimum). 
The pink and blue lines are the fits of the energy dispersions. b-d, Polarization- 


the Poincaré sphere (Fig. 1a) with 0 = arccosS, and @ = arctan 
(S2/S,), where S,, S:, S; are the components of the pseudospin 
vector S. The QGT components g; and B, are computed analyti- 
cally?’in Supplementary Note 1 using the eigenstates of equation (1) 
with $||Q. 


Quantum geometry of emergent Dirac cones 


The sample studied is a high-quality microcavity with a 100-ps life- 
time (see Methods section ‘Sample details’). The experimental setup 
is shown in Extended Data Fig. 1. The measurements are executed at 
4 Kina reflection configuration under an applied external magnetic 
field. We first use off-resonant continuous-wave laser excitation. The 
photoluminescence is measured versus the 2D wavevector and energy 
for all six polarization axes of the Poincaré sphere (Fig. 1a) correspond- 
ing to the horizontal-vertical (HV), diagonal-antidiagonal (DA), and 
circular right-left (RL) polarizations. For each wavevector, the ener- 
gies of the polarization doublet are found by Gaussian fitting of the 
emission. Their pseudospin (Stokes vector) is determined from the 
polarization intensities as: 


Iya ly 
Iythy’ 


Ina In 
Intl,’ 


5S,(k) = 5S,(k) = (5) 


The k-space pseudospin distribution then enables us to compute the 
QGT components of each branch using equation (4). 

Figure 1 shows the O T measurements (no Zeeman splitting, 4, = 0) 
at zero exciton-photon detuning (see Methods section ‘Exciton- 
photon detuning’). The energy dispersion extracted from the raw pho- 
toluminescence (see Methods section ‘Experimental setup to measure 
the QGT’ and Supplementary Note 2) is shown in Fig. 1b. The inset shows 
the energy difference between the modes. By fitting the dispersion, 
we obtain the polariton mass m = (9.2 + 0.1) x 10m, (mg is the free 
electron mass), the TE-T™M splitting 26 = 26.3 + 0.3 peV pum’, and the 
birefringence (HV splitting) 2a =15.3 + 0.6 peV at k= 0. (Errors denote 
the standard deviation of the fitting parameters.) If the HV splitting 
were zero (a= 0), the dispersion would be composed of two parabola 
of different masses touching at k = 0, similar to the quadratic band 
degeneracies in bilayer graphene. The Berry phase accumulated along 
a closed loop around the band touching point would be 2rt (a Berry 
topological charge of 1). When the HV splitting is non-zero (a # 0), as 
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k, um) 
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degree maps of the lower energy mode for RL (b), HV (c) and DA (d). The crosses 
mark the anticrossing points. e, Pseudospin distribution in k-space, magnified 
near one of the crossing points. The in-plane pseudospin (S,, S,) is shown by the 
white arrows (lengths in arbitrary units), and the S$; amplitude (dimensionless) 
is shown by colour. 


inour sample, the cylindrical symmetry is broken. Along k,,, the lowest 
energy mode has the smallest mass. The two parabola cross at 
k= ./a/B = 0.8 um |, where the TE-TM and HV splitting cancel each 
other. Along k, such points are absent, because both contributions 
have the same sign. This is visible in Fig. 1c, d, which shows the HV and 
DA polarization degree of the lower band (the circular polarization 
degree is zero at OT). 

The degeneracy points, marked by crosses, are tilted Dirac cones, 
around which the effective field and pseudospin textures appear simi- 
lar to 2D monopoles (Fig. 1e, f). The breaking of the TE-TM rotational 
symmetry by the HV field induces the separation of the TE-TM vector 
field of winding number 2 into a pair of 2D monopoles of winding 1, 
but of opposite divergences. Each monopole carries a Berry topologi- 
cal charge of 1/2, so that the band topology does not depend on the 
HV splitting, but the band geometry does. The Berry curvature of each 
monopoleis a delta function, whereas the metric has a finite extension 
measured in Fig. 1g. Therefore, any finite-duration measurement of the 
Berry phase performed by making a loop around these points should 
show a deviation from the adiabatic value of 1 quantified by this metric 
distribution. These effective monopoles can be mapped to emergent 
non-Abelian gauge fields acting on photons”. Interestingly, the metric 
distribution around the crossing points is not cylindrically symmetric, 
which might be due to non-hermiticity”’. 

Now we break the time-reversal symmetry (Fig. 2), applying a 9 T 
magnetic field described by the Zeeman term 4,. The field also makes 
the exciton-photon detuning slightly negative (see Methods section 
‘Exciton—photon detuning’), owing to the exciton diamagnetic shift. 
Figure 2a shows the dispersions along k,, and ky, as in Fig. 1b (the inset 
shows the energy difference). The modes are now split everywhere in 
kspace, and at k=O the energy splitting is approximately 102 peV. The 
crossing along k,, becomes an anticrossing. The splitting at the anticross- 
ing pointis the polariton Zeeman splitting 24, = 100.9 + 0.6 peV (where 
the error is the standard deviation of the parameter), caused by the 
excitonic part (exciton g factor of about 0.2). It is much larger (by ten 
times) than the linewidth of our ultrahigh-quality sample, despite the 
optical frequency operation. The measured polarization degrees are 
shown in Fig. 2b-d. The polarization at k = O becomes elliptical. The 
circular polarization degree decreases along ky and increases along k,, 
up to+K,, where it becomes close to 1. A magnification of the measured 
pseudospin texture around k, is shown in Fig. 2e, exhibiting a part ofa 
half-skyrmion, as expected. 
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Fig. 3| Berry curvature and quantum metric distributions. a—d, Experiment, 
k-space distribution of QGT elements (from photoluminescence at 9 T): Berry 
curvature B, (a); Sy, (b); yy (c); and gy, (d); extracted using equation (4).e-h, 


The k-space distributions of the Berry curvature and of the three 
components of the quantum metric tensor, extracted from the experi- 
mental data of Fig. 2 using equation (4), are shown in Fig. 3a—d. They 
are compared with analytical calculations (Fig. 2e—h; formula givenin 
Supplementary Note 1) performed using the parameters extracted from 
the dispersions in Figs. 1b, 2a. Without HV splitting, the Berry curvature 
would be circularly symmetric, whereas for a dominating HV splitting 
the distribution would be concentrated around the anticrossing points. 
Here, we are between these two limiting cases. A similar procedure 
applied to the second polarization branch (see Supplementary Fig. 4), 
confirms that the two branches are cross-polarized, show opposite 
Berry curvatures and the same quantum metric elements. 


Anomalous Hall effect 
A consequence of non-trivial band geometry is the anomalous Hall 
drift of an accelerated wavepacket which appears as a correction in 
the semiclassical equation of motion”: 

Or OE 


at Ok 


——+FxB (6) 


where r is the centre of mass of the wavepacket, F(k) is the dispersion, 
F(k) is the accelerating force and B=B,e,, where e, is the unit vector in 
thez direction. The acceleration is provided by the thickness gradient 
of the microcavity. The resulting energy gradient accelerates polaritons 
similar to the way an electric field accelerates charges. We choose a 
sample region with the largest gradients of 6 meV mm ‘and negative 
exciton-photon detuning. The gradient is measured independently 
(see Supplementary Note 3) and exhibits a slight spatial variation (a 
saddle-type potential). We selectively excite the lower polariton branch 
at k=O witha continuous-wave laser (30-j1m-diameter spot). Figure 4a 
shows the spatial distribution of the intensity at +9 T (removing noise) 
under elliptically polarized excitation of the lower eigenstate (see Meth- 
ods section ‘Experimental setup to measure the polariton anomalous 
Hall effect’). The two traces separate along their propagation. This 
is confirmed in Fig. 4b, which shows the measured centre-of-mass 
trajectories, well reproduced by numerical simulations based on the 
semiclassical equation (6), using as input parameters the potential and 
the Berry curvature distribution computed using equations (1) and (4) 
and the experimentally fitted parameters. Interestingly, this saddle-like 
potential magnifies the drift by a factor of 1.6 with respect to a constant 
gradient (see Supplementary Note 3). The oscillations of experimental 
trajectories are attributed to sample disorder. They remain smaller 
than the global drift. The role of non-adiabaticity on the trajectories, 
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Theory. The computations are based on the effective Hamiltonian (equation 
(1)) and Supplementary Note1. 


quantified by the quantum metric’, cannot be evidenced here, owing 
to the experimental uncertainties. However, non-adiabaticity can be 
increased by modifying the excitation conditions. Figure 4c shows the 
cross-polarized emission when the excitation is circularly polarized, 
slightly exciting the upper polarization eigenstate (black, experiment; 
blue, full Schrédinger simulation presented in Supplementary Note 3). 
The contrast of these oscillations is the non-adiabatic fraction, also 
given by the distance between the quantum states at k=O and k=k, 
determined from the metric g; (more details given in Supplementary 
Note 4) as: 
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Fig. 4 | Polariton anomalous Hall effect. Continuous-wave-laser resonant 
excitation of the lower polariton mode at k=0 with 30-um-diameter excitation 
spot.a, Spatial distribution of emission at +9 T (pink) and —9 T (blue). b, Centre- 
of-mass trajectories at +9 T (red) and—9 T (blue). The experimental data 
(averaged over groups of points) are shown by squares with error bars 
representing the average value of the standard deviation within each group and 
the theory-based data from the semiclassical equation are shown by solid lines. 
Inaand bthe polarization is elliptical, corresponding to the polarization of the 
eigenstate. c, The polarization of the excitation is circular, which leads to 
oscillations in the intensity of the emission in the cross-circular polarization. 
Black, experiment; blue, theory. a.u., arbitrary units. Xand Yare thein-plane 
axes of the sample. 
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where dtis taken along the shortest distance path. The distance that 
we estimate from the oscillations is 0.16 + 0.02, whereas equation (7) 
gives 0.18 + 0.03, showing a remarkable agreement. (The errors here 
correspond tothe uncertainty estimates discussed in the Supplemen- 
tary Information.) 

Our experiments provide a measure of both the full non-trivial band 
geometry of a 2D continuous system and, independently, real-space 
wavepacket motion demonstrating anomalous Hall effect and non- 
adiabaticity. The experiments support the validity of the semiclassi- 
cal approach and band geometry to compute wavepacket dynamics. 
Our results demonstrate that 2D photonic modes, because they are 
TE and TM polarized, carry topological charges, which is essential 
for topological photonics. Indeed, by using an appropriate lattice, 
the geometrically non-trivial bulk photon dispersion transforms into 
gapped topologically non-trivial photonic Bloch bands” **°* the QGT 
of which can be explored by our technique”. The polaritonic platform 
(interacting photons) able to demonstrate lasing and quantum fluid 
behaviour (superfluidity, quantized vortices, etc.) opens up opportuni- 
ties for topological physics. The platform has already enabled notable 
advances, suchas topological lasers”, and offers exciting possibilities, 
suchas mixing different topological effects related to quantum vortices 
and band structures®. 
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Methods 


Sample details 

The sample used for this experiment is a high quality-factor 3/2A 
GaAs/AIGaAs planar cavity. The high growth precision of the sample 
via molecular beam epitaxy results in a sample with a quality fac- 
tor exceeding 100,000, and the associated lifetime, tT, for the lower 
polariton branch surpasses 100 ps. This lifetime is measured by 
propagation experiments at negative exciton—-photon detuning, as 
done previously****. Moreover, the quantum wells show large areas 
(up to several hundreds of micrometres) that are free from defects. 
The cavity contains 12 GaAs quantum wells, 7-nm thick, organ- 
ized in groups of four and placed at three antinode positions of the 
electric field. The front (back) mirror consists of 34 (40) pairs of 
AIAs/Alo g9Gao20As layers. The exciton energy is F.,,. = 1.611 eV and the 
Rabi splitting is RQ, = 16 meV at O T. The full polariton dispersion 
measurement evidencing the exciton-photon anticrossing is shown 
in Supplementary Fig. 6. 

The exciton diamagnetic shift (blueshift) is 4 meV at 9 T. The oscilla- 
tor strength increases when applying 9 T, leading to a17% increase in 
the Rabi splitting. These two effects almost cancel each other, result- 
ing in a small blueshift for the bottom of the lower polariton branch 
of 0.4 meV at 9 T. The exciton-photon detuning (see below) at 9 Tis 
therefore more negative by 4 meV than at O T. This detuning change 
remains moderate compared with the Rabi splitting value. The exciton 
fraction of the exciton polariton is reduced from 0.5 at 0 T to about 
0.4 at 9 T, whereas the photon fraction increases from 0.5 to about 
0.6. The exciton diamagnetic shift and the increase of the exciton oscil- 
lator strength are well known effects that are the result of the decrease 
of the exciton Bohr radius under a magnetic field**”. 


Exciton-photon detuning 

One of the most important parameters controlling the properties 
of exciton polaritons is the exciton-photon detuning”, which is the 
difference between the energies of the two bare resonances at k= 0: 
6 = Epnot ~ Fexce This parameter controls the excitonic and photonic 
fractions of the lower polariton branch—that is, their deviation from 
equal values of 50% at 6= 0. For example, the excitonic fraction (the 
square of the Hopfield coefficient) for relatively small detunings is 
x(6) =1/2 + 6/2hQ,. As an example, as the detuning becomes more nega- 
tive, the polaritons become more photonic, which means (among other 
things) that the exciton-related Zeeman splitting 4, decreases and the 
photon-related spin-orbit coupling f increases. 

To perform the two experiments—the QGT and the anomalous Hall 
drift measurements—we selected two different regions of the same 
wafer. For the QGT measurements, we selected a central region of the 
sample at 6=0 meV and OT. Then, to have the adiabatic acceleration 
needed to observe the anomalous Hall drift effect, we selected a lateral 
region of the same wafer showing a rapid change in the energy of the 
lower polariton branch (approximately equal to3nm mm"), anda 
negative detuning of 6=-10 meV. 


Experimental setup to measure the QGT 

The microcavity is cooled to 4 K in a closed-loop helium cryostat 
equipped with a superconductive magnet able to generate a field 
onto the sample that spans from —9 T to 9 T in a Faraday configura- 
tion (the external magnetic field is perpendicular to the microcavity 
plane). 

For the measurement of the QGT, the excitation is performed by an 
off-resonance linearly polarized continuous-wave 2-um-diameter laser 
spot, tuned to the first minimum of the stopband oscillations, so as to 
maximize the injection. The sample excitation and the collection of the 
polaritonic photoluminescence is performed in a reflection scheme, 
by means of a wide numerical aperture objective (0.86), resulting ina 
14-1" field of view. A 8-tum™ portion of k space is then reconstructed 


on the monochromator slits so that the photoluminescence can be 
energetically resolved. To avoid any loss of k-space information, the 
entire detection lineis built in a2fconfiguration and the required polari- 
zation filtering is performed in the real-space plane. The polarization 
response of the setup is characterized before the experiments. The raw 
photoluminescence data are collected by an automatic Labview routine 
able to perform a complete tomography in any of the three polariza- 
tion bases (HV, AD or RL), via the sequential passage of light through 
a pair of motorized quarter- and half-waveplates and a polarizer. The 
energy mapping onto the charge-coupled device (CCD) camerais per- 
formed througha550-cm monochromator equipped witha grating of 
1,800 lines per mm andslit aperture set to 80 pm. For each polarization, 
ascan of 561 images is acquired, each containing an /(E, ky) map ata 
given k,,, upon moving atranslational stage mounting the final lens by 
steps of 12 um. Inthis way, athree-dimensional set of photoluminescent 
data, /(E, k,, ky), is collected in any of the six polarization states. The 
energy resolution of the image is 5E= 0.015 nm. The momentum reso- 
lutions are 5k, = 0.008 pm per pixel and 6k,, = 0.014 um per frame, 
corresponding to the momentum magnification with respect to the 
CCD pixel size and scan lens movement step, respectively. 


Experimental setup to measure the polariton anomalous Hall 
effect 
The anomalous Hall drift experiment is also performed in reflection 
configuration, inthe same cryostat at the same temperature and mag- 
netic field conditions, but using a3-cm focal distance doublet ensuring 
areal-space field of view exceeding 500 pum. Resonant excitation of 
the lower polariton mode at k= 0 is performed, witha polarization of 
excitation which is adjusted to the one of the eigenstates. The chosen 
region of the sample exhibits the highest gradients™. 

The experimental uncertainties for all types of measurements are 
discussed in Supplementary Note 5. 


Numerical analysis 

We start by fitting the total intensity for each wavevector (k,,, ky) with 
a double Gaussian curve, which enables the discovery of the energies 
of the two eigenstates, F,. Then, the intensities of the six polarization 
components are obtained at the energies of the eigenstates by integra- 
tion within the Gaussian width, and the components of the pseudospin 
calculated from these intensities. If the modes are almost degenerate 
in total intensity, with the energy difference falling below the inhomo- 
geneous broadening, they can often still be distinguished by separately 
studying the spectra in the polarization components. This enables 
resolution of the branches for small energy differences. The pseudospin 
maps of the lower and upper eigenstates, encoded inthe angles @and @, 
are then smoothed witha low-pass filter that eliminates noise. Finally, 
the components of the QGT are calculated from the pseudospin using 
equation (4). The gradient is obtained by the Green—Gauss method 
with simple face averaging. Parallel computing is used to accelerate the 
treatment of 4.6 x 10° experimental datapoints. The final resolution of 
the QGT components is 1,024 x 561. 


Data availability 


The datasets generated and/or analysed during the current study are 
available in the Open Science Framework (OSF) repository at https:// 
osf.io/s4rzu/?view_only=1cabd49416c04a9baed856dee3aelba9. 
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Extended Data Fig. 1| Experimental setup. Schematic of the polarization 
tomography experiment. The incoming pump laser (bottom right) is focused 
onto the sample held in the cryogenic superconductive magnet (bottom left). 
The emission is recollected, polarization filtered and the momentum space 
optically rebuilt at the entrance slits of aspectrometer (top) with energy 
resolution of 30 peV (top left). The Zeeman splitting is highlighted in the inset. 
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Attosecond pulses are central to the investigation of valence- and core-electron 
dynamics on their natural timescales! *. The reproducible generation and 
characterization of attosecond waveforms has been demonstrated so far only 
through the process of high-order harmonic generation‘ ’. Several methods for 
shaping attosecond waveforms have been proposed, including the use of metallic 
filters®’, multilayer mirrors’® and manipulation of the driving field". However, none of 
these approaches allows the flexible manipulation of the temporal characteristics of 
the attosecond waveforms, and they suffer from the low conversion efficiency of the 
high-order harmonic generation process. Free-electron lasers, by contrast, deliver 
femtosecond, extreme-ultraviolet and X-ray pulses with energies ranging from tens of 
microjoules to a few millijoules””. Recent experiments have shown that they can 
generate subfemtosecond spikes, but with temporal characteristics that change shot- 
to-shot"*"*. Here we report reproducible generation of high-energy (microjoule level) 


attosecond waveforms using a seeded free-electron laser”. We demonstrate 
amplitude and phase manipulation of the harmonic components of an attosecond 
pulse train in combination with an approach for its temporal reconstruction. The 
results presented here open the way to performing attosecond time-resolved 
experiments with free-electron lasers. 


The intensities and relative phases between the harmonics gw, (with 
qan integer and w, the fundamental frequency) in an extreme ultra- 
violet (XUV) frequency comb determine the temporal structure of 
the resulting attosecond pulse train. The intensities of the harmonics 
can be easily measured using a (photon or electron) spectrometer. 
Phase information, which is harder to come by, is usually obtained 
by observing the interference between different pathways leading to 
states with the same final energy, where the phase to be characterized 
is included in at least one of the pathways. With XUV pulses, the natu- 
ral observable is a photoelectron, hence different pathways into the 
ionization continuum are studied. The XUV frequency comb produced 
by high-order harmonic generation (HHG) consists of odd-integer 
harmonics of the fundamental field, and the ionization process takes 


place in the presence of a near-infrared (NIR) dressing field with the 
same frequency w,. Under these conditions, additional photons may be 
absorbed or emitted, producing a single sideband halfway between the 
main photoelectron peaks. Each sideband can be populated through 
two pathways leading to final states of the same parity and this results 
ina variation in sideband amplitude as a function of the relative phase 
of the two pathways. If the XUV and fundamental fields are precisely 
synchronized, as they can be in HHG, then delaying the fields with 
respect to each other reveals the phase information*”®. 

In our study, the harmonic comb was generated by the seeded free- 
electron laser (FEL) FERMI, which uses an ultraviolet pulse 
(@yy = @; = 4.69 eV) derived from a frequency-tripled NIR pulse 
(@yig = Wyy/3) as the seed. Three (g = 7, 8, 9) and four (qg =7, 8, 9, 10) 
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Fig. 1| Multi-photon sideband generation and principle of the 
measurement. a, Schematic view of multi-NIR-photon sideband generation. 
Shownare the energy levels of the photoelectrons generated by the harmonics 
of the FEL (H;-H,9; magenta arrows) and by the additional absorption and 
emission of one and two NIR photons (Se aa red arrows). b, Expected 
normalized intensity of the photoelectron spectra asa function of the relative 
delay t between the train of attosecond pulses and the NIR field along the 
(positive) common direction of polarization of the two fields. The 


harmonics of @yy were generated using two different undulator con- 
figurations (see Extended Data Fig. 1 and Extended Data Table 1). To 
characterize the pulses, photoionization took place inthe presence of 
a field with frequency yp leading to the formation of two sidebands 
between each pair of the main XUV peaks (see Fig. 1a). The two side- 
bands San can be each populated through two paths characterized 
by a different number of exchanged NIR photons: the absorption of 
one photon of the harmonic qand one (two) NIR photon, or the absorp- 
tion of one photon of the harmonic q+ 1and the emission of two (one) 
NIR photons. The difference in parity of the final states of the interfer- 
ing pathways determines an asymmetry in the photoionization emis- 
sion. If the observation is restricted along a single direction, the 
intensity of the sidebands oscillates as a function of the delay rbetween 
the NIR and XUV pulse (see Fig. 1b) 

Sehea(D) % 1+ g,qr1 CO8|B,- G+ 3OneT|=14P, g(t) 
where a,,,,, depends on the intensity and energy of the two harmonics 
qand q +1 with phases @, and @,,,, on the photoelectron energy, and 


photoelectron spectra are characterized by an oscillation with a period 
T=21t/(3@nj,). c-f, Correlation plots of the oscillating components of the 
sidebands (P,,s, Ps.) for four phase differences A@,,o; 0 (c), 11/2 (d), 1 (e) and 
31/2 (f). At the top of each plotis shown the value of A@,5 (left) and 97.5 (right). 
g, Evolution of the correlation coefficient p,,. as a function of A@;s., showing 
the locations corresponding to the plots c-f. The intensity of the NIR pulse is 
Inig=1.5 10" W cm. 


onthe intensity of the NIR pulse, and the equality defines the oscillat- 
ing component of the sideband intensity P, ,,, under the approxima- 
tions detailed in the Supplementary Information. If the delay t could 
be precisely controlled, then the relative phase between consecutive 
harmonics could be estimated from the time shift between the oscil- 
lations of the sidebands. 

This approach cannot be applied directly to the reconstruction of 
the relative phase of multiple harmonics generated by an FEL owing 
to the lack of subcycle synchronization between the harmonics and 
the NIR field”, which completely washes out the delay dependence 
of the sideband oscillations. The information can be still retrieved, 
however, through a correlation analysis of the fluctuating sideband 
intensities measured ona single-shot basis. This approachis presented 
in Fig. 1c—f, which shows the simulated correlation plots of the oscil- 
lating components P,, and P,, of the sidebands for arandom variation 
of the delay rin the range +3 fs, which is the typical delay jitter meas- 
ured in the experiment”. The correlation plot eliminates the explicit 
dependence on rand results in an ellipse, the shape of which depends 
onthe phase difference: 
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Fig. 2| Correlation plots of the oscillating components of the sidebands and 
retrieval of phase difference A@, ,... a—j, Evolution of the correlation plots of 
the oscillating components of the sidebands (P,,s, Ps.) for increasing values of 
the delay r,, introduced by the phase shifter PS, (Extended Data Fig. 1a). At 

the top right of each panel is given the value of Ag,s.. The colour scale indicates 
the density of single-shot experimental points normalized to unity for each 
panel.k, Evolution of the correlation coefficient p,s, as a function of the delay 
T,2 (black points and dashed line) and the sinusoidal fit (red). The Ag, upperx 


AQ 1.4,q+1 = ut Q-1~ 29, (2) 


The intensity profile of the pulse train depends only on this phase 
difference (apart from atrivial time shift; see Supplementary Informa- 
tion). Depending onthe phase difference A@,,., the plot evolves froma 
linear distribution with positive correlation (Fig. 1c), toa circle (Fig. 1d), 
to alinear distribution with negative correlation (Fig. le), and finally 
back toa circle (Fig. 1f). These changes clearly indicate that the shape 
of the correlated distribution is related to the synchronization of the 
three harmonics (the complete evolution as a function of the phase 
difference is presented in Extended Data Fig. 2). The phase informa- 
tion can be derived from the distribution by evaluating its correlation 
coefficient P,-4 4,9: (see Supplementary Information), which quantifies 
the extent to which the two oscillating components oscillate perfectly 
in phase (A@q4.4,9:1 = 0 and Py-4,4,941 = +1) or out of phase (AQ,1,4.441= 1 
and 94-1.4,91=~1). The correlation coefficient oscillates as a function of 
the phase difference Ag, as shown Fig. 1g, andit closely resembles a 
cosine function. Two different values of Ag, correspond to the same 
value of the correlation coefficient: @) and 21 — @o. This ambiguity can 
be resolved in the experiment by controlling the modulus and sign of 
the phase differences between the harmonics (see Supplementary 
Information). Simulations based on the solution of the time-dependent 
Schrodinger equation confirmed the validity of our approach (see 
Extended Data Fig. 3). 
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axis (red) was obtained by assigning the maxima of the fit to the values 
A@,39= 2mm, where mis an integer. Letters a—j show the locations 
corresponding to the data shown in the panels above. The error bars are the 
standard deviation of the average correlation coefficient p,;, evaluated over 
ten sets of experimental data (each set consists of 1,200 single-shot points). 
The intensity of the NIR pulse was estimated to be Jyjp=1.5 x10" Wcm™. The 
value of the correlation parameters, the phase differences and the 
corresponding errors are presented in Extended Data Table 2. 


Inthe experiment, the intensity of the harmonics was independently 
controlled by tuning the undulator gaps and the dispersive section of 
the electron transport optics. The phase between the harmonics was 
controlled by phase shifters”, which introduce a delay f,;(iindicates 
the ith-phase shifter) for a selected harmonic q, affecting the phase 
difference Ag, 14,9: through a term q@wyT,; (see Extended Data Fig. 1). 
In this respect, seeded FELs offer a superior degree of control with 
respect to HHG sources, for which the intensities and phases of the 
single harmonic cannot be independently controlled. 

Figure 2a-j presents the experimental results for the three-harmonic 
case, for different delays T,, introduced on the ninth harmonic. These 
measurements indicate a periodic evolution of the correlated distri- 
butions in close agreement with the theoretical prediction. A partial 
broadening of the distributions is attributed to the shot-to-shot fluctua- 
tions of the single harmonic intensity. The correlation coefficients p,5, 
(black points and dotted line in Fig. 2k) and the fit (red curve) clearly 
followa cosine evolution in good agreement with the simulations. The 
maxima of the fit were assigned to the phase differences Ag,,,=0, 2 
(see upper x axis in Fig. 2k) and the curve was used to assign a phase 
difference Ag,,, to each delay T,,. The error in the estimation of the 
phase difference depends on the slope of the curve (which depends 
on the NIR intensity) and was typically in the range 0.05-0.1 rad for 
our experimental conditions (see Supplementary Information). The 
characterization of pulses with reproducible temporal structure gives 
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Fig.3| Complete phase and amplitude shaping of attosecond waveforms. a— 
g, Photoelectron spectra (a), correlation plots of the oscillating components of 
the sidebands (P,,, Ps; b, c, d) and reconstructed attosecond waveforms 

(e, f, g) inthe case of independent phase shaping for three phase differences 
A@ 9 (black curve and panels b, e; red curve and panels c, f; blue curve and 
panels d, g). The colour scale indicates the density of single-shot experimental 
points normalized to unity for each panel. The three photoelectron spectra 
inawere vertically shifted for visual clarity. h-n, Photoelectron spectra (h), 
correlation plots of the oscillating components of the sidebands (P_5, Ps o; i,j, k) 
and reconstructed attosecond waveforms (I, m,n) inthe case of independent 


the possibility of accumulating data over several single-shot measure- 
ments, thus improving the signal-to-noise ratio and reducing the error 
inthe temporal reconstruction. In the case of aself-amplified spontane- 
ous emission FEL, pulse properties change ona shot-to-shot basis and 
a single-shot technique is therefore mandatory”. 

We exploited this approach to the determination of the relative phase 
of XUV harmonics to demonstrate the independent phase-amplitude 
shaping capability of attosecond waveforms offered by the FEL FERMI. 
Figure 3a and Fig. 3b, c and d show three photoelectron spectra and 
the corresponding correlation plots for three phase differences A@, 55, 
respectively. The phase change does not appreciably modify the inten- 
sities of the three harmonics (for the amplitudes F; of the single jth 
harmonic and for the phase differences see Extended Data Table 3). 
The reconstructed intensity profiles /(t) are presented in Fig. 3e, fand 
g. The measurements indicate a pure phase shaping of the harmonic 
comb: a well-defined attosecond pulse train (Fig. 3e) obtained for 
A@ 789 = 0.08 + 0.08 rad (close to the ideal condition of harmonics in 
phase, A@;.=0) is transformed first into an attosecond pulse train of 
lower amplitude with a satellite (Fig. 3f) when Aq, =1.93 + 0.03 rad, 
and finally into an attosecond pulse train characterized by a double 
structure for A@,35=3.29 + 0.24 rad (Fig. 3g), which is close to the con- 
dition of harmonics out of phase, A@,. =. Figure 3h shows three 
photoelectron spectra corresponding to three different settings of the 


amplitude control for three settings of the harmonic amplitudes (black curve 
and panelsi, I; red curve and panelsj, m; blue curve and panels k, n) using the 
same values of the phase shifters. The colour scale indicates the density of 
single-shot experimental points normalized to unity for each panel. See 
Extended Data Table 3 for additional information on the phase difference 
Ag@,3,and the amplitude of the harmonics F,, F,and F,. The errors inthe 
reconstruction of the attosecond pulse trains are determined by the error bars 
for the amplitudes and phase differences (see Extended Data Table 3) and are 
indicated as shaded areas ine, f,g,1,m,n. 


amplitudes of the three harmonics (see Extended Data Table 3). The 
amplitude of the single harmonic was modified by about 25% using the 
dispersive section and the undulator gaps (see Supplementary Informa- 
tion). Figure 3i,j and kshowthe correlation plots for the same position 
of the phase shifters. The phase difference Ag,,, remains constant 
within the experimental error, independent of the variations of the 
single harmonic intensity. The reconstructed attosecond pulse trains 
for the three configurations are presented in Fig. 31, m and n. These 
data demonstrate a pure amplitude shaping of the harmonic comb: 
the well-defined attosecond pulse structure (A@,., is close to zero 
(21) for the three measurements) is preserved for the three configu- 
rations and the different harmonic intensities lead only toa variation 
in the maxima of the intensity profiles. The energy of the attosecond 
pulse train was about 16 pJ. Table-top-based HHG sources yield much 
lower energies (in the nanojoule range), and only a few experimental 
groups have reported total pulse energies on target approaching the 
microjoule range”. 

We estimated that a pulse energy of about 50 nJ per harmonicis suf- 
ficient for the acquisition of single-shot photoelectron spectra, whichis 
well below the typical energy per harmonic (a few microjoules) available 
at FERMI. The currently available range of seed wavelengths at FERMI 
(360-230 nm) would allowa moderate control of the comb periodicity 
around 1fs, but arevised layout of the seed laser optimized for this task 
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Fig. 4| Synthesis of complex attosecond waveforms. a-i, Correlation plots of 
the oscillating components of the sidebands Py, Py 19 (a, d, 8) and P,s, Ps. (b, e, h) 
and retrieved attosecond waveforms (c, f, i) for the four-harmonic experiment 
and three different combinations of the phase differences A@,5 and A@g 19 (a- 
c;d-f; g-i). The colour scale indicates the density of single-shot experimental 
points normalized to unity for each panel. See Extended Data Table 3 for 


could increase the spike separation to tens of femtoseconds. Shorter 
separations can already be achieved by using only odd or only even 
harmonics (for example, g = 6, 8, 10). 

We should point out that alternative FEL-based approaches have 
been theoretically proposed for the generation ofa train of attosecond 
pulses”>”°, Even though the predicted peak power levels (gigawatts) 
and pulse durations (down to sub-100 as) are comparable with those 
reported here, or in principle achievable with our approach, these meth- 
ods donot offer a strategy for controlling the relative amplitudes and 
phases of the single harmonics, that is, for attosecond pulse shaping. 
Extension of our approach to wavelengths as short as 4 nm (300 eV) 
appear feasible if combined with the echo-enabled harmonic genera- 
tion seeding scheme”. Numerical simulations indicate that the method 
for the temporal characterization could be applied for photon energies 
from about 20 eV up to 300 eV, using a suitable gas target. 

Asa first demonstration of complex attosecond waveform shaping, 
we considered the case of four harmonics (see Fig. 4), for which the 
attosecond temporal structure depends on the two phase differences 
A@789 and A@go19. The photoelectron spectra with (red curve) and 
without (black curve) NIR are shown in Extended Data Fig. 1d. The inde- 
pendent control of the two phases gives the opportunity to generate 
ultrashort (chirp-free) attosecond pulse trains, as shown in Fig. 4a-c, 
which report the correlation plots (P3,.—Poo in Fig. 4a and P,,—P3, in 
Fig. 4b) in the case of maximum positive correlation (see Extended 
Data Table 3 for the values of the amplitude and phase differences). The 
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additional information on the phase differences A@, sand A@so 10, and the 
amplitude of the harmonics F,, Fs, Fyand F,. The errorsinthe reconstruction of 
the attosecond pulse trains are determined by the error bars for the amplitudes 
and phase differences (see Extended Data Table 3) and are indicated as shaded 
areasinc,f,i. 


reconstruction (Fig. 4c) returns a duration (FWHM) of the single pulse 
of about 210 +4 as. Figure 4d and e present the results corresponding 
respectively to A@g54) = 0.20 + 0.15 rad (close to the configuration of 
harmonics in phase, A@go 9 = 0) and A@;¢4 = 1.23 + 0.06 rad (harmon- 
ics only partially in phase). In the temporal domain, this condition 
translates into a partial broadening of the peaks (FWHM = 220 + 5S as) 
and the appearance of small satellites in the reconstructed attosec- 
ond pulse train (Fig. 4f). Finally, Fig. 4g and h present the results when 
A@so9,10 = 2-89 + 0.08 rad and A@,,,=2.95 + 0.09 rad (that is, both phase 
differences are close to 1), respectively: the four harmonics are divided 
into two groups (harmonics 7 and 8, and harmonics 9 and 10), each pair 
of whichis (approximately) in phase, with an additional phase jump of 
Tt between the two groups. This condition leads to a double attosecond 
pulse structure, which is visible in the reconstruction presented in 
Fig. 4i. The availability of six undulators at FERMI would in principle 
allow for the generation of six harmonics. This configuration may, 
however, require a revised and optimized setup. Simulations indicate 
that the experimental technique demonstrated in this work could be 
extended to the characterization of pulses with durations in the sub- 
100-as regime. 

Our technique also offers the possibility of determining with subcycle 
resolution the relative phase of the XUV and NIR pulses, enabling phase- 
resolved pump-probe experiments at FELs based on attosecond pulse 
trains (see Supplementary Information and Extended Data Fig. 4). The 
high intensities in the XUV and X-ray spectral region reached by FELs, 


combined with the capabilities offered by seeding to independently 
control and shape the amplitudes and phases of attosecond pulses, 
will open new possibilities for the investigation and control of ultra- 
fast nonlinear electronic processes. The design of future seeded FEL 
sources at other facilities such as LCLS”*, FLASH” and SINAP* could be 
modified to optimize them for the mode of operation described here, 
whichis at present possible at FERMI” and DALIAN”. In solid samples, 
attosecond shaped waveforms could be used to promote electrons 
from the inner-valence to the conduction band, giving the possibil- 
ity of investigating diffusion and relaxation effects with attosecond 
resolution and with temporally sculpted electronic wave packets. More 
generally, our results give unprecedented access to programmable 
attosecond waveforms at high intensities. 
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Methods 


Experimental setup 

The experiment was performed at the seeded FEL FERMI and is sche- 
matically presented in Extended Data Fig. 1. Two different configura- 
tions of the undulators were implemented for the generation of three 
(Extended Data Fig. 1a) and four (Extended Data Fig. 1b) harmonics. 
The parameters of the FEL harmonics are reported in Extended Data 
Table 1. The seeding parameters (seed laser power and strength of the 
dispersive section) were carefully optimized in order to produce a suf- 
ficiently high bunching that could be preserved along the whole set of 
undulators tuned at the various harmonics. Tuning the first undula- 
tors to higher harmonics (shorter wavelengths) and the later ones to 
the lower harmonics had a twofold motivation. First, the bunching at 
higher harmonics was more prone to degradation and it would have 
been more difficult to preserve it up to the end of the undulator chain. 
Second, the diffraction at longer wavelengths was larger thus a shorter 
propagation path was preferable. For a properly optimized setup each 
undulator group produced a coherent, ~50-fs-long FEL pulse centred 
at the resonant wavelength. The phase between the electric field of 
each harmonic was determined and controlled by the phase shifter 
available at FERMI at each undulator break. 

The XUV and NIR pulses (energy Ey;p=45 pJ, duration FWHM,;,= 60 fs, 
intensity /yyp = 1.5 x 10” W cm”) were temporally and spatially over- 
lapped in the interaction region with a residual shot-to-shot delay jitter 
of At=+3 fs using arecombination mirror for collinear propagation. The 
single-shot photoelectron spectra, with (red lines) and without (black 
lines) the NIR pulse for the three and four harmonics configurations 
(see Extended Data Fig. 1c, d, respectively), were measured in neon 
using a magnetic bottle electron spectrometer placed in the interac- 
tion region (see Extended Data Fig. le). Only photoelectrons emitted 
in the upward hemisphere were collected, as shown by the detection 
efficiency (see Extended Data Fig. 1f) as a function of the angle between 
the emission direction and the spectrometer axis (which coincides 
with the (vertical) direction of polarization of the FEL pulses (see Sup- 
plementary Information)). For each machine setting, single-shot har- 
monic spectra (without NIR field) were measured”. From these data 
we estimated a typical shot-to-shot fluctuation (standard deviation) 
for the intensity of each harmonic of about 5%-8%. The energy of the 
single harmonic was proportional to the integral of the corresponding 
peak of the photoelectron spectrum. The integral was corrected for the 
response function of the magnetic bottle spectrometer for different 
photoelectron energies, the cross-section of the target gas, and the 
transmission of the XUV beamline”. 

The data were accumulated typically for about 10,000-12,000 shots 
for each machine and phase setting. For each setting, the mean intensi- 
ties of the main photoelectron peaks /, (q = (7, 8, 9) or g= (7, 8, 9, 10)) 
were determined. 

The total energy of the FEL pulse was measured (on a single-shot 
basis) with an ionization monitor placed upstream of the transmission 
and focusing XUV beamline. 


Reconstruction of attosecond pulses 

The temporal reconstruction of the attosecond pulse train using 
the correlation parameter ,-1,4,9+ for the simulations presented in 
Fig. 1c-f is shown in Extended Data Fig. 5a—d, respectively. The agree- 
ment between the input data (black curves) and the reconstructed 
profiles (red (Extended Data Fig. 5a), blue (Extended Data Fig. 5b), 
green (Extended Data Fig. 5c), and magenta (Extended Data Fig. 5d)) 
indicates the validity of our reconstruction method based on the value 
of the correlation parameter p,5,. We also performed time-dependent 
Schrédinger equation (TDSE) simulations which confirmed the validity 
of our reconstruction protocol. The correlation curve obtained using 
the TDSE simulations reproduces that obtained by the strong field 
approximation (SFA) witha small shift of 0.157 rad, which results in only 


minor corrections in the reconstruction of the intensity profile of the 
attosecond pulses. This shift was taken into account in the reconstruc- 
tions presented in the manuscript. 

The validity of the temporal reconstruction of the attosecond pulse 
train from the shift of the oscillations of the sidebands Siete (S.2.) 
ands, (Si.4) is shown in Extended Data Fig. 5e, which reports the 
input (black lines) and reconstructed (blue dotted lines) intensity pro- 
files for the simulation shown in Fig. 1b. 


Simulations for attosecond pulse generation 

The emission process for the configuration emitting harmonics 7, 8 
and 9 can be simulated with the new version of the FEL code GENESIS 
1.34. With the undulator set in the condition reported in Extended 
Data Fig. la, the resonant wavelength to be followed by the FEL code 
is different in the three sets of undulators. Given the fact that GENESIS 
1.3 only tracks a relatively narrow band field this requires that different 
simulations are performed for the different sets of undulators. This 
optionis normally used for harmonic generation FEL schemes and has 
been largely used in the study of the high-gain harmonic-generation 
operation mode implemented at FERMI. The problem is here compli- 
cated by the fact that consecutive undulators are tuned to wavelengths 
that are not integer multiples of one another. The recent upgrade of 
GENESIS 1.3 allows tracking of each single electron in the beam. The 
simulation can be performed if one carefully manages the transition 
from one undulator set to the next. After the interaction with the exter- 
nal seed laser is simulated, and the energy modulation at the 260 nm 
wavelength is imprinted inthe beam, the particle phase-space is used to 
simulate the emission process at the ninth harmonic in the first group 
of undulators. The electric field produced here is then propagated 
in free space to the exit of the whole radiator, while the electrons are 
used for simulating the emission process at the eighth harmonicin the 
second group of undulators. The same is done with fields and electrons 
entering the third group of undulators. Finally, three fields are pro- 
duced representing the emission from each undulator set at the exit 
of the radiator. The result of the simulation is presented in Extended 
Data Fig. 6a, b that shows the femtosecond envelope (Extended Data 
Fig. 6a) and the attosecond structure (Extended Data Fig. 6b) of the 
XUV pulse obtained using the combination of the three harmonics. The 
resonance condition, phase shifts between sets of undulators, and other 
parameters can be adjusted as usual. Simulations rely on the standard 
electron beam and seed laser for FERMI. Seed laser power and strength 
of the dispersive section are used as an optimization parameter to 
maximize the bunching and keep emission balanced between various 
harmonics. The parameters used in the simulation are summarized in 
Extended Data Table 4. 
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Raw data were generated at the FERMI large-scale facility. Derived data 
supporting the findings of this study are available from the correspond- 
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Extended Data Fig. 1| Free-electron laser configuration for the generation measured for the three (c) and four (d) harmonic cases. e, Schematic, half- 


of multiple harmonics and experimental setup.a,b, Configurationsofthesix section viewofthe spectrometer, including the ion flight tube (bottom) and 
undulators (U,-U,) for the generation of three (a) andfour (b)harmonics.Inthe electron flight tube (top). f, Normalized simulated geometrical collection 


first case, two undulators per harmonic were used, while in the second case, efficiency as a function of polar emission angle for 2-42 eV electrons, using a 
each harmonic was generated by one undulator only. The phase-shifters cylindrical magnet configuration with the pole placed 5 mm away from the 

(PS,-PS,) used to control the relative phase between the harmonics are interaction region. Electrons were simulated using steps of 5 eV. Anemission 
indicated in yellow for the two configurations. c, d, Typical single-shot angle of 0° (180°) corresponds to the axis of the spectrometer in (away from) 


photoelectron spectra without (black lines) and with (red lines) the NIR pulse, the direction of the electron detector. 
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Extended Data Fig. 2| Simulated correlation plots. Shown are simulated correlation plots (Ps, Ps) for different values of Ag, from 0 to 2ttin steps of 1/4: 
A@z8,9= 0 (a), 1t/4 (b), 11/2 (c), 31t/4 (d), Tt (e), Stt/4 (f), 31/2 (g), 71/4 (h) and 21 (i). The intensities of the three harmonics are equal. 
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Extended Data Fig. 3| Simulated correlation parameter p,,, . Evolution of the 
correlation parameter p,as a function of the phase difference A@;5,9 
simulated using the SFA (red) and the TDSE (blue). The black curve indicatesa 
cosine evolution. 
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Extended Data Fig. 4 | Phase reordering of single-shot sideband intensities. 
a,b, Intensity of the sidebands S$} (a) and S (b) (black points) as a function of 
the relative phase 3@y,,T between the attosecond pulse train and the NIR field. 
The red curves show sinusoidal fits of the distributions. c, Comparison of the 
reconstructed attosecond pulse train using the correlation parameter method 
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Pz89 (black curve) and the ‘reconstruction of attosecond beating by 
interference of two-photon transitions’ method (red curve) based on the phase 
differences extracted from the sinusoidal fits. The second method is typically 
used for the characterization of attosecond pulse trains produced by HHG. The 
error inthe reconstructions is indicated by the shaded areas. 
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Extended Data Fig. 5 | Reconstruction of attosecond pulses for multi-NIR 
photon transitions. a—d, Input (black line) and reconstructed (a, red line; 

b, blue line; c, green line; and d, magenta line) intensity profiles of the 
attosecond train, corresponding to Fig. 1c-f for phase differences Ag,,.=0 (a), 
1/2 (b), 1 (c) and 31t/2 (d).e, Reconstruction of attosecond pulses from 
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sideband oscillations for multi-NIR photon transitions for the trace presented 
in Fig. 1b (input (black line) and reconstructed (blue dotted line) intensity 
profiles). The intensity of the NIR pulse is /yjp=1.5 x10" W cm”. The relative 
phases between the harmonics are: @,) — @y=108°, @) - ~,=160° and gs - g,=8°. 
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Extended Data Fig. 6 | GENESIS 1.3 simulations. Shown is the attosecond pulse train simulated using the GENESIS 1.3 code: a, complete temporal evolution of the 
train, and b, magnified view of the attosecond pulses inthe train. 
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Extended Data Table 1| XUV experimental parameters 


Harmonic Photon energy (eV) Energy (uJ) Duration (fs) Intensity (W/cm?) 


7 32.88 + 0.03 4.21+0.58  50+5 (1.1 + 0.2) x 10%4 

Three 8 37.57 + 0.04 5.29+0.58  50+5 (1.5 + 0.2) x 10%4 

harmonics 9 42.27 + 0.04 6.7 + 1.2 50+5 (1.9 + 0.4) x 104 

7 32.88 + 0.03 1.00140.12 50+#5 (2.8 + 0.4) x 1023 

Four 8 37.57 + 0.04 0.95+0.12 50+5 (2.6 + 0.4) x 1033 

harmonics 13 
9 42.27 + 0.04 0.67011 50+#5 (1.9 + 0.4) x 10 

10 46.96 + 0.05 0.40+0.50 50+5 (1.1 + 0.3) x 1023 


Measured harmonic order, photon energy, energy of the single harmonic and intensity of the single harmonic for the three- and four-harmonic experiments. For the duration of the single 
harmonic, we used the values reported in ref. *. 


Extended Data Table 2 | Experimental correlation coefficients 


Panel a b c d e 
P7,8,9 0.83 40.02 0.7440.01 0.32+0.03 -0.20+0.02 -0.62 + 0.03 
AQ789 (rad) 9.94 +0.12 0.50+0.05 1.21+0.05 1.914£0.03 2.62 +0.05 
Panel f g h i j 
P7,8,9 -0.75+0.02 -0.46+0.01 0.1440.03 0.67+40.01 0.86+0.01 


A®78,9 (rad) 3.3340.20 4.04+0.03 4.7440.03 545+40.02 6.16+0.02 


Correlation coefficients 0,2 and phase differences A@7.. for the measurements presented in the panels of Fig. 2 in the main text. 
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Extended Data Table 3 | Amplitudes and harmonic phase differences 


Figure Fro Fy Fg Fy AQs,9,10 (rad) A@®7,5,9 (rad) 

3b,e - 0.95+0.06 1.0040.03 0.89+0.06 - 0.08 + 0.08 

3c,f - 0.94+0.07 1.004003 0.87+40.06 - 1.93 + 0.03 

Three 3d,g - 0.92+0.07 1.0040.03 0.88+0.06 : 3.29 + 0.24 
harmonics — 3}, - 0.75+0.06 1.00+0.04 1.06+0.06 - 6.08 + 0.07 
3j,m - 0.78+0.08 1.02+0.04 0.80+0.06 - 6.09 + 0.40 

3k,n - 0.85+0.09 0.8340.05 0.82+0.07 - 6.18 40.12 

4ac  0.59+0.07 0.76+0.06 1.0040.05 1.034005 0.204015 5.94+0.04 

eee 4Ad-f 0.5840.07 0.7740.07 1.0440.05 1.0740.05 0.204015 1.2340.06 
4g-i 0.57+0.08 0.77+0.08 1.0340.06 0.9940.05 2.89+0.08  2.95+0.09 


Amplitudes Fio, Fs, Fg and F, and phase differences A@goi9 and A; for the three- (Fig. 3) and four-harmonic cases (Fig. 4). For the phase (amplitude) shaping in the three-harmonic case, the 
photoelectron spectra were rescaled to the area of the peak corresponding to the eighth harmonic in Fig. 3b and e (Fig. 3i and l). For the four-harmonic case, the photoelectron spectra were 
rescaled to the area of the eighth harmonic in Fig. 4a—c. For the pulse reconstruction, the measured A@7., phases were corrected for the shift of 0.157 rad obtained by the TDSE simulations. 


Extended Data Table 4| Parameters for the GENESIS 1.3 
simulations 


Undulator parameters 


U,-U, resonant wavelength 28.889 nm 
U3-U, resonant wavelength 32.500 nm 
U,;-U, resonant wavelength 37.143 nm 
Undulator polarisation linear 
PS,-PS, 0 rad 
Dispersion 60 um 
Seed laser parameters 
Wavelength 260 nm 
Power 40 MW 
Pulse length (FWHM) 110 fs 
Electron beam parameters 
Energy 1.2 GeV 
Energy spread 110 keV 
Normalized emittance 1 mm rad 
Peak current 700A 
Beam size 50 - 70 um 


Parameters used in the GENESIS 1.3 code for simulating the generation of the train of 
attosecond pulses. 
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Extensive efforts have been made to harvest energy from water in the form of 


raindrops’ °, river and ocean waves”*, tides’ and others? ”. However, achieving a high 
density of electrical power generation is challenging. Traditional hydraulic power 
generation mainly uses electromagnetic generators that are heavy, bulky, and 
become inefficient with low water supply. An alternative, the water-droplet/solid- 
based triboelectric nanogenerator, has so far generated peak power densities of less 
than one watt per square metre, owing to the limitations imposed by interfacial 
effects—as seen in characterizations of the charge generation and transfer that occur 
at solid-liquid’* or liquid-liquid*"’ interfaces. Here we develop a device to harvest 
energy from impinging water droplets by using an architecture that comprises a 
polytetrafluoroethylene film on an indium tin oxide substrate plus an aluminium 
electrode. We show that spreading of an impinged water droplet on the device bridges 
the originally disconnected components into aclosed-loop electrical system, 
transforming the conventional interfacial effect into a bulk effect, and so enhancing 
the instantaneous power density by several orders of magnitude over equivalent 
devices that are limited by interfacial effects. 


Our droplet-based electricity generator (DEG) is based on our recent 
work” showing that the continuous impinging of water droplets ona 
fluorinated material induces a high charge density on its surface. Our 
DEG device (Fig. 1a) is fabricated using drop-casting of polytetrafluoro- 
ethylene (PTFE), deposited with a tiny piece of aluminium, ontoa glass 
substrate coated with indium tin oxide (ITO). As shown in Fig. 1b and 
Extended Data Fig. 1, the as-fabricated device is optically transparent, 
smooth and slippery. We hypothesized that, with continuous droplet 
impinging, the PTFE—a promising electret material with high charge- 
storage capability and stability°”—could serve as an ideal reservoir for 
charge storage, while electrostatically inducing opposite charge of the 
same amount onthe ITO for possible charge transfer to an aluminium 
electrode. We find that when a falling water droplet spreads on the 
PTFE surface, it bridges the originally disconnected components 
(the PTFE/ITO and aluminium electrode) into a closed-loop, electrical 
system. 

Figure 1c shows the time-dependent variation of measured surface 
charges on the PTFE film of an as-fabricated device under a relative 
humidity of 65.0%. With an increase in the number of impinging tap- 
water droplets (ion concentration 3.1 mM), there is a gradual increase 
in the amount of surface charge”. Eventually, after around 1.6 x 10* 
droplets, the surface charge reaches a stable value of about 49.8 nC, 
indicating that continuous droplet impinging can serve as a robust way 


to maintain stable and sufficient surface charge on the PTFE surface 
(Extended Data Fig. 2a). 

We measured the electricity generation of an individual imping- 
ing droplet on the as-fabricated device in which the PTFE surface had 
been stored with sufficient and stable charges as a result of contact 
electrification between liquid and solid after continuous droplet 
impinging up to around 1.6 x 10* times. As shown in Fig. 1d and Sup- 
plementary Video 1,400 commercial light-emitting diodes (LEDs) could 
be powered to instantaneously light up when four droplets of 100.0 ul 
each, released from a height of 15.0 cm, contact the device. Focusing 
onan individual DEG indicates that the open-circuit output voltage 
and short-circuit current were about 143.5 V (Fig. le) and 270.0 pA 
(Fig. 1f), respectively—around 295.0 and 2,600.0 times higher than 
the values obtained without an aluminium electrode (Extended Data 
Figs. 2b, 3). The instantaneous peak power density is 50.1 W m7? under 
aload resistance of 332.0 kQ (Extended Data Fig. 2c), which is three 
orders of magnitude higher than that of the control device without 
an aluminium electrode. We calculate the average energy-conversion 
efficiency of our DEG—defined as the harvested electrical energy rela- 
tive to the input energy of an impinging droplet—to be roughly 2.2%, 
which is several orders of magnitude higher than that of our control 
device without an aluminium electrode. Note that the instantane- 
ous peak density can be enhanced further by increasing the surface 
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Fig. 1| Design of the DEG. a, Schematic diagram. b, Optical image showing four 
parallel DEG devices fabricated ona glass substrate. The volume of each 
droplet is100.0 pl. c, As individual droplets continue to impinge on the as- 
fabricated device, the amount of charge on the PTFE surface increases 
gradually and eventually reaches a stable value. d, One hundred commercial 
LEDs can be powered when one droplet, released froma height of 15.0 cm, is in 
contact with the device. e, Under the same experimental conditions (for 
example, the same droplet size and height of release), the output voltage 
measured from the DEG (in red, with the frequency of impinging droplets being 


charge of the PTFE film using an ion-injection method” (Extended 
Data Fig. 4a, b). However, the long-term operation of the DEG device 
precharged by ioninjection is susceptible to a gradual degradation of 
surface charge, eventually exhibiting performance comparable to that 
obtained through continuous droplet impact (Extended Data Fig. 4c). 

The boost in the output performance of our device compared with 
the conventional design suggests that the DEG might operate viaa 
different mechanism. First, as shown in Fig. 2a, the essential charges 
carried by the droplet before and after its impact on the precharged 
DEG are negligible. Moreover, there is no notable difference in the 
charge generated by droplets dispensing from a grounded versus an 
ungrounded outlet”. Having ruled out any effects of droplets them- 
selves and of the dropper on electricity generation, we next analysed 
the time-dependent evolution of the output current (Fig. 2b, c). Initially, 
uponcontact with the PTFE surface, there is no apparent output current 
from the spreading droplet: the device is in a ‘switched-off’ state. This 
is essentially what occurs in the conventional design, with the charge 
generation being limited by an interfacial effect. The current then exhib- 
its alarge acceleration with a pronounced peak of up to 213.7 pA at an 
on-time (¢,,,) of O ms, transitioning into a switched-on state. Careful 
inspection shows that the sharp increase in the current originates from 
the contact of the spreading droplet with the aluminium electrode. We 
propose that this is a result of directional and rapid transfer of charge 
fromthe ITO electrode to the aluminium electrode. As plotted in Fig. 2d, 
in the early stage of droplet spreading, there is a rapid increase in the 
measured charges—a pattern consistent with the observed current- 
time curve. As the droplet continues to spread, charge transfer between 


set at 4.2 Hz, and the total number of droplets being about 84) is more than two 
orders of magnitude higher than that from the control device (in black, witha 
droplet frequency of 1.0 Hz, and a total of 20 impinging droplets). The 
negligible electricity generation from the control device is limited by the 
interfacial effect, although its PTFE surface is loaded with the same amount of 
charge as the DEG. f, Comparison of output current from the DEG (inred) and 
the control device (in black) in response to continuous impinging of individual 
droplets. 


the ITO and aluminium electrodes continues until the droplet reaches 
its maximum spreading area, A,,,., Of 2.7 cm’, which is associated witha 
maximum charge, Q,,,,, of 49.8 nC (Fig. 2d). With retraction and sliding 
of the droplet from the impacting centre, the positive current turns to 
negative, indicating a back flow of charge from the aluminium electrode 
tothe ITO. At an off-time (¢,,,) of 16.0 ms, the water droplet can be fully 
detached fromthe slippery aluminium electrode; this is accompanied 
by the output current and charge dropping to zero (Supplementary 
Video 2). In this condition, all charge is restored to the ITO, and anew 
cycle starts. This reversibility is confirmed by the measurement of 
cyclic charge. As shown in Fig. 2d, the amount of charge transferred 
between the ITO and aluminium electrodes in each cycle is constant—an 
indication that there is no deterioration of surface charge on the PTFE 
film. This is also suggested by the long-term measurement of charge 
stability (Fig. 1c). 

To further understand the mechanisms underlying the performance 
of our DEG, we next examined the variation in the measured Q,,,x 
transferred from the ITO to the aluminium electrode as a function of 
the Weber number (Extended Data Fig. 5). This number is defined as 
We = pDv’/y, where D, v and y are respectively the diameter, impact 
velocity and surface tension of the droplet”*”®. As shown in Fig. 2e, 
with an increase inthe Weber number from 7.7 to 150.4, the transferred 
charge increases from 13.3 nC to 53.3 nC. Thus, the increase in the 
amount of transferred charge between the ITO and aluminium elec- 
trodes in response to a varying Weber number suggests that the elec- 
tricity generation is exquisitely regulated by the interaction between 
the impinging droplet and the configuration of the DEG, rather than 
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Fig. 2| Origin of boosted electricity generation. a, The charges carried by the 
droplet (Q4ropier) before and after impinging on the PTFE surface are negligible 
compared with the measured charge of the DEG (Q,,,,). Data are means + the 
standard error of the mean (s.e.m.). For each mean, the total number of 
measurements is around ten. b, Time-resolved variation in current generated 
from the DEG during the entire droplet impact process. The dashed lines 
delineate the specific part of the current waveform showninc.c, 
Synchronization of droplet-spreading dynamics and current response, and 
mapping of the time-dependent variation in charge flowing between PTFE/ITO 
and the aluminium electrode. The droplet retracts but still maintains contact 
with the aluminium electrode, while the current reverses to anegative value. 


originating from just interfacial contact electrification. Moreover, the 
output is insensitive to the size and spatial location of the aluminium 
electrode and to the electrode material (Fig. 2f and Extended Data 
Fig. 6). 


Swiched off 


Voltage (V) 


Fig. 3 | Circuit model. a, Inthe switched-off mode, there is no capacitor formed 
at the water/aluminium interface. As aresult, C, and C,remaininan open circuit 
and there is no charge flow between them. b, When the aluminium electrode 
and PTFE are connected by the water droplet (switched-on mode), another 
capacitor, C,, is established at the water/aluminium interface, forming a closed 
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Insets are snapshots showing droplet dynamics. d, Time-dependent variation 
inthe transferred charge, Q, generated onthe DEG by an impinging droplet, 
indicating that the charge can return to zero when the DEG moves to switched- 
off mode. e, Variation in the maximum charge, Q,,.x, generated by an impacting 
droplet on the DEG under different Weber numbers. Data are means + s.e.m. For 
each mean, the total number of measurements is around ten. f, Output voltages 
remain constant when the aluminium electrode is replaced by a gold or silver 
electrode, suggesting that electricity generation is not sensitive to the specific 
electrode material. For all specific electrodes, the frequencies of impinging 
droplets and the total numbers of droplets are 4.2 Hzand about 28, 
respectively. 


Looking at the device froma circuit perspective, the spreading drop- 
let can be treated as a resistor and the PTFE as acapacitor, C,, in which 
the water/PTFE serves as the top plate and PTFE/ITO as the bottom 
plate. In the switched-off mode, no capacitor is formed at the water/ 


Oo 
| ql 
a 
L o 
| 60 
p29 go q 
Oo © 40 
L & 
Ql 8 
o 20 
c 0 
x 6 8 10 12 14 16 182 
Thickness of PTFE (tum) 
6 8 10 12 14 16 18 


Thickness of PTFE film (um) 


circuit. R,,R, and dq(t)/dtin the circuit are, respectively, the impedance of the 
water droplet, the impedance of the external load and the derivative of the 
transferred charge with respect to time. c, Output voltage and maximum 
charge (Q,,,x) aS. a function of PTFE thickness. Data are means +s.e.m. For each 
mean, the total number of measurements is ten. 
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Fig. 4 | Stability and generality. a, Effect of relative humidity on charge loading 
stability. b, Optical image of our system for collecting and dispensing rainwater 
droplets. c, Tailoring the sizes of dispensing droplets for enhanced output. 


aluminium interface, and the circuit maintains an open state (Fig. 3a). 
By contrast, in the switched-on mode, capacitors are formed at the 
water/PTFE interface and the water/aluminium interface, transforming 
the original open circuit into a closed circuit (Fig. 3b). Given that the 
thickness of the PTFE is several orders of magnitude larger than that of 
the electric double layer at the water/solid interfaces, the capacitance 
of the capacitor C, is negligible compared with that of the capacitors 
formed at the water/PTFE interface (C,) and the water/aluminium inter- 
face (C,). In combination with the high-density surface charge stored in 
the PTFE, the voltage across C, is dramatically higher than that across 
C,and C,. Thus, the instantaneous peak output voltage, V, occurs when 
the spreading droplet is in contact with the aluminium electrode, and 
canbe approximated as Q,,,,/(€pA max), Where dand €, are respectively 
the thickness and dielectric constant of the PTFE film. Given a measured 
Armax Of 2.7 cm? and a Qina, OF 49.8 nC, we calculate the voltage established 
across the PTFE to be roughly 143.5 V, consistent with our experimental 
measurement. Moreover, the measured peak voltage increases linearly 
with the thickness of the PTFE film, consistent with the predictions of 
our circuit model (Fig. 3c). Upon completion of the charging process, 
the C, capacitor is recharged by the other two capacitors, as observed 
in current and charge measurements. 

To gain molecular-level insights into the process of charge transfer 
between the PTFE and the aluminium electrode when a water dropletis 
incontact with both—that is, inthe switched-on mode—we carried out 
molecular-dynamics simulations, using a nanoscale system with a water 
slab containing positive and negative ions. The simulations predict a 
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Data are means +s.e.m. For each mean, the total number of measurements is 
around ten. d, Harvesting of hydrodynamic energy from different water 
sources: tap water, rainwater and sea water. 


concerted motion and rapid separation of mobile positive and negative 
ions with the presence of an internal electric field between the PTFE 
and aluminium, once the fixed charges on the PTFE and aluminium 
are switched on (Extended Data Fig. 7). The separation of the mobile 
positive and negative charges in the water droplet towards to the water/ 
PTFE interface and the water/aluminium interface, respectively, reveals 
the charging process of the two capacitors at the water/PTFE interface 
(C,) and water/aluminium interface (C,) at the molecular level. 

We also examined the stability of our devices under harsh environ- 
ments involving high relative humidity”. Ata relative humidity of 100%, 
the surface charge increases gradually as the number of impinging 
droplets increases, reaching a saturated value of 28.1 nC (Fig. 4a). As 
the relative humidity is reduced to 70%, the surface charge increases 
rapidly and levels off at a stable value of 44.0 nC. On further exposure to 
100% relative humidity, the charge output returns to 28.1nC, indicating 
that continuous droplet impinging can help to maintain aconstant and 
steady output even in harsh environments. This enhanced charge stabil- 
ity can be ascribed to the combination of continuous droplet imping- 
ing and the good charge-carrying capability of PTFE. By contrast, for 
control devices made of porous PTFE”, polydimethylsiloxane (PDMS) 
or polypropylene, under a relative humidity of 65.0%, performance 
decays owing to the poor charge stability of the surfaces* (Extended 
Data Fig. 8). 

In addition to tap water, our DEG can harvest hydrodynamic energy 
from both raindrops and sea water. For raindrops, we designed 
a home-made platform consisting of a droplet collector and a 
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capillary-tubing-based dispenser (Fig. 4b and Supplementary Video 3). 
By adjusting the diameter of the capillary tubing and the height of 
release, we can precisely control the size and velocity of raindrops 
that contact the DEG arrays for enhanced on-demand output (Fig. 4c). 
Similarly, sucha platform can separate a continuous flow of sea water 
into discontinuous droplet arrays, allowing for efficient electricity 
generation from a wide range of water-energy sources. We note that 
energy conversion from seawater droplets is lower than that from tap 
water and raindrops; however, it is still much higher than that of the 
conventional tap-water-based approach (Fig. 4d). 
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Methods 


Materials 

Acetone (RCI Labscan, 99.5%), ethanol (Sigma Aldrich, 97%), nitric 
acid (Sigma Aldrich, 70%), porous PTFE film (Sterlitech Corporation, 
PTU023001), PTFE precursor (Dupont AF 601S2, 6 wt%) and PDMS 
(Dow Corning Sylgard 184) were used without further purification. 
The PTFE precursor is composed of PTFE dissolved in 4,5-difluoro- 
2,2-bis(trifluoromethyl)-1,3-dioxole (a low-boiling organic solvent), 
which does not contain any extra additives. 


Fabrication of DEG device 

To fabricate the DEG, we first ultrasonically cleaned a piece of ITO glass 
slide, of size30 mm x 30 mmx 0.4 mm, inacetone and then ethanol for 
10 min each. We then deposited the PTFE precursor onthe ITO glass by 
drop-casting, and heated it at 120 °C for 15 min to remove all solvent 
inthe PTFE precursor. Upon curing at 120 °C, the PTFE precursor was 
transformed into asmooth and dense PTFE film, as shown by scanning 
electron microscopy (SEM; Extended Data Fig. 1). The thickness of 
the PTFE film can be adjusted by controlling the volume of precursor 
(Extended Data Fig. 9). To construct the aluminium electrode, we assem- 
bled atiny conductive aluminium tape of size of 1mm x5mmx50 um 
onto the as-prepared PTFE film. For comparison, we also fabricated a 
control device in the same way but without the aluminium electrode. To 
fabricate a control device with PDMS film as the dielectric layer, we spin- 
coated a liquid mix of polydimethylsiloxane and a curing agent (ratio 
10/1) witha volume of 200 pl onto ITO glass at a speed of 3,000 revolu- 
tions per minute, and then cured the film at 80 °C for 1h. To fabricate 
control devices with porous PTFE film and commercial polypropylene 
tape asthe dielectric layer, we attached the porous PTFE film and com- 
mercial polypropylene tape directly onto the ITO glass slide. 


Characterization and electrical measurement 

We used asyringe pump and aplastic tube to generate water droplets. 
The droplet size could be tailored by varying the inner diameter of the 
plastic nozzle connecting to the outlet of the plastic tube. The inner 
diameter of the nozzle required to generate droplets of 100 pl was 
6.0 mm. If not specified, the composition of water droplets was tap 
water at anion concentration of 3.1mM. The volume of water droplets 
was fixed at 100 pl and the droplet outlet was not earthed. We recorded 
the spreading and retraction dynamics of water droplets using a high- 
speed camera (Photron FASTCAM SA4) at atypical recording speed of 
6,000 frames per second. The voltage output of DEG was measured 
using an oscilloscope (Rohde and Schwarzrte, RTE1024) equipped 
with a high-impedance (10 MQ) probe. We measured the current and 
the charges transferred between the ITO and aluminium electrode 
using the oscilloscope coupled with a low-noise current preamplifier 
(Stanford Research System Model SR570) and a Faraday cup connected 
with a nanocoulomb meter (Monroe model 284), respectively. The 
as-fabricated device was tilted at 45.0° for rapid liquid detachment. 
To measure the variation in maximum charges, Q,,,,, transferred from 
ITO to aluminiumas a function of the Weber number or the maximum 
spreading area, we varied the releasing heights of droplets between 
1cmand 20cm. Intypical measurements, we kept the relative humid- 
ity and the environmental temperature at approximately 65.0% and 
20.0 °C, respectively. 


Continuous droplet impinging and electricity generation 

We showed in the main text that the DEG made of PTFE loaded with suf- 
ficient charge allows for reversible and efficient electricity generation. 
Here we demonstrate that sufficient charge on PTFE can be achieved by 
continuous droplet impinging. Extended Data Fig. 2a shows the varia- 
tionin output voltage measured from an individual impinging droplet 
as a function of the number of droplets impinging. Q,,,, andthe output 
voltage increase gradually with increasing droplet impinging times, 


eventually reaching a plateau with the charge and voltage stabilized 
at 49.8 nC and 143.5 V, respectively, after impinging of 1.6 x 10* times 
(Fig. 1c). This charge-loading method is applicable to a wide range 
of thicknesses of the PTFE film. Our result shows that the maximum 
transferred charges are comparable after 1.6 x 10* times of droplet 
impinging when the thickness of the PTFE film varies from 6.7 um to 
16.9 um (Fig. 3c). Note that the voltage increases linearly with film thick- 
ness, agreeing with the predictions of our circuit model. 


Comparison with a conventional generator 

We also characterized the performance of acontrol sample that lacks 
the aluminium electrode. Note that the PTFE surface of this control was 
prepared using the same method as for DEG. Extended Data Fig. 3a, 
b shows an optical image of the as-fabricated droplet-based control 
device and a schematic drawing of its basic working mechanism". 
Before the droplet contacts the PTFE, the amount of (positive) charges 
onthe ITO isthe same as the (negative) charges on the PTFE because of 
electrostatic induction. Thus, there is no current flow from ITO to the 
ground (Extended Data Fig. 3b, i). When a water droplet leaves the PTFE 
surface after impacting (Extended Data Fig. 3b, ii), the droplet becomes 
positively charged while the PTFE is more negatively charged asa result 
of contact electrification (Extended Data Fig. 3b, iii)””. Accordingly, 
a flow of current between the ground and ITO electrode is induced 
(Extended Data Fig. 3b, iv). As shown in Extended Data Fig. 3c, d, the 
voltage and transferred charge generated from nine droplets impinging 
onthe control device are measured when the frequency of impinging 
droplets is set at 1.0 Hz. For a single droplet, the output voltage and 
the amount of transferred charge are roughly —0.4 V and 0.075 nC, 
respectively, both of which are negligible compared with those of 
the DEG. Moreover, for the nine droplets, the total amount of trans- 
ferred charge in the control device is measured to be identical with the 
accumulated charge carried by the departing droplet (Extended Data 
Fig. 3d), confirming that electricity generation in the control device 
indeed originates from the triboelectric effect, a natural interfacial 
phenomenon. Note that by using continuous droplet impinging or the 
ion-injection method”*™, the amount of negative charge on the control 
PTFEsurface can be enhanced, which can then induce positive charges 
on ITO. However, these positive charges cannot be released from the 
ITO because of attraction by negative charges on the PTFE, and there 
is no pronounced electricity generation, in striking contrast to the 
DEG. All of these results highlight the unique advantage of the DEG, 
whichis characterized by a bulk effect and hence enhanced electricity 
generation. 


Calculation of average conversion efficiency 

To quantify the performance of our DEG, we calculated the average 
conversion efficiency of mechanical energy into electric energy. We first 
calculated the instantaneous conversion efficiency of the ith droplet 
impinging, 7, as follows: 


Con U 
cout = J RL os (1) 


where the droplet mass, m, is 0.1 g; the gravitational acceleration, g, 
is 9.8 ms; the relative height between the releasing droplet and the 
DEG, h, is 15.0 cm; and R, = 10.0 MQ. For our DEG, the kinetic energy 
carried by a droplet (of 100 pI) released from a height of 15.0 cmat the 
1.6x10*th droplet impinging (thatis, i=1.6 x 10*) is roughly 1.47 x10*J, 
and the generated electrical energy is calculated as 3.2 pJ (Extended 
Data Fig. 4b), responding to an, of around 2.2%. Such efficiency is five 
orders of magnitude higher than that of our control device (around 
2.1x 10%). 

Next, we discuss the overall conversion efficiency (7,) of the first n 
droplets, including those droplets used in precharging; this conversion 
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efficiency is calculated as the average of the instantaneous conversion 
efficiency of all droplets (i=1, 2, 3,...,n), expressed as: 


dif; 


The n, of individual droplets can be obtained as above (equation (1)). The 
overall efficiency will increase with the number of impacting droplets. 
For example, the overall efficiency approaches 2.2%—a conversion 
efficiency obtained for a single falling droplet impacting on the DEG 
in the steady state—when the number of impinging droplets is more 
than 2.0 x 10°. 


Maximum spreading area 

Wealso studied the dependence of the maximum spreading area, A,,.x, 
onthe surface charge under a fixed impacting Weber number of 100. 
For a PTFE film without loaded charges, A,,,, is measured at 2.71 cm’, 
whichis comparable with the A,,,,, value measured for PTFE with loaded 
charges (2.72 cm’) (Extended Data Fig. 5). This result suggests that 
the maximum spreading area of a droplet on the PTFE surface relies 
on the predetermined release height of the impinging droplet, and is 
insensitive to the surface charge of the PTFE film. 


Circuit analysis 

We now discuss the entire droplet and device from a circuit perspec- 
tive. When a droplet spreads ona PTFE surface loaded with sufficient 
negative charge, the base contact area with the PTFE varies dynamically 
as afunction of time. A capacitor, C,, is formed with the water/PTFE as 
the top plate and PTFE/ITO as the bottom plate, respectively. At the 
water/PTFE interface, there is an additional capacitor, C,. Before the 
impinging droplet contacts the aluminium electrode, there is no 
capacitor formed at the water/aluminium interface. As a result, C, and 
C, remain in an open circuit and there is no charge flow between them 
(Fig. 3a). By contrast, when the aluminium electrode and PTFE are con- 
nected by the liquid (switched-on mode), the other capacitor, C,, is 
established at the water/aluminium interface. Thus, C,, together 
with C, and C,, forma close circuit. The instantaneous peak output 
voltage, V, occurs when the spreading droplet is in contact with the 
aluminium, and can be approximated as Q,,,,,d/(€pA max): In this circuit, 
the time-dependent capacitance of C,, C, and C, can be expressed 
as Cp(t) = A(OeP/d, C(t) = A(HEy/Agp, and C,(t) = A, (O€y/Agp,, respec- 
tively, where A(¢) and A,(t) are the time-dependent contact areas of the 
water/PTFE interface and the water/aluminium interface, respectively; 
and ¢,,andA,p, are the dielectric constant of PTFE and the width of EDL, 
respectively. The equivalent circuit is shown in Fig. 3b, governed by 
the following differential equation: 


dq(t)_ Q0)-40) git) ge) 
dé GO) CO CO 
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q(t=0)=0 


where q(f) is the transfer charge, and R, and R,, are the impedance of 
the external load and water droplet, respectively. 


Molecular-dynamics simulations 

To simulate ion transport and separation in a water droplet in contact 
with the as-fabricated device, we carried out molecular-dynamics simu- 
lations. To this end, we used the transferable intermolecular potential 
with four points/for the simulation of water solid ice (TIP4P/ICE)” water 
model, whichis popular in molecular-dynamics simulations of water. 
Various properties of water—including static properties such as the 
melting point of ice, liquid density in ambient conditions, and water/ 
ice phase diagram, as well kinetic properties such as water’s diffusion 


constant in ambient conditions—have been successfully reproduced 
using TIP4P/ICE. To mimic ion conduction in tap water, we introduce 
identical amounts of sodium (Na*) and chloride (CI) ions into the water. 
The molecular-dynamics system includes a slab of water containing 
4.0 x 10* water molecules with 808 Na‘ and 808 CI ions. To mimic the 
charged PTFE and aluminium electrode, we use rigid atomic trilayers. 
We fix 800 negative and 800 positive charges, witha spacing of 8.7 A, on 
the middle and bottom atomic layers of PTFE (for the negative charges) 
and the ITO electrode (for the positive charges) ; each site is charged te 
or-e. The box size of the model system is 17.3 nm x 17.3 nm x 31.4nm, in 
which the thickness of water layer is about 4.5 nm. Periodic boundary 
conditions are applied inthe x andy directions. The parameters for Na* 
and CI are taken from previous work” (oy, =2.876 A; €y,= 0.5216 kJ mol; 
0) =3.785 A; €¢)= 0.5216 kJ mol; where ¢ is the depth of a potential well, 
and ois the finite distance at which the interparticle potential is zero. 
The cross Lennard-Jones interaction parameters between water and 
Na* and CI ions are given by the Lorentz-Berthelot rule. The inter- 
actions between substrate atoms (sub) and the NaCl water solution 
are described by the 12-6 Lennard-Jones potential (Oy9-<yp = 3-021 A; 
Ena-sub = 0.4785 kJ mol: Ocreup = 3-476 AF Ect cup = 0.4785 kJ mol; 
Oo-<ub = 3-458 At Ep- «up = 0.6223 kJ mol). We used the fast smooth particle- 
mesh Ewald method to model electrostatic interactions with a real- 
space cut-off of 10 A. Van der Waals interactions are truncated at 10 A. 
We integrate Newton’s equations of motion witha time step of 1 fs by 
using a leap-frog algorithm in the molecular-dynamics simulations. 
We use a Nosé-Hoover scheme to maintain the systems at a constant 
temperature. All molecular-dynamics simulations are carried out using 
Gromacs 4.5.5 software. First, we perform molecular-dynamics simula- 
tions of water with dissolved Na* and CI ions but without any charges 
on the substrates while the temperature is maintained at 300 K. The 
simulation lasts 3 ns. Next, we perform molecular-dynamics simula- 
tions in the constant-temperature and constant-volume ensemble at 
300 K for both switched-off and switched-on mode. The simulation 
lasts 5 ns for each mode. 


Data availability 


The data that support the findings of this study are available from the 
corresponding authors on reasonable request. 
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Extended Data Fig. 1| Surface morphology and sliding behaviour of water 
droplets onthe PTFE film and aluminium electrode. a, SEM image of the PTFE 
film used in our DEG. Upon curing at 120 °C and solvent evaporation, the PTFE 
precursor is transformed into asmooth and dense PTFE film. b, Photograph of 
the fabricated PTFE film together with the ITO glass on a logo, showing the high 
transparency of the film. c, A droplet of roughly 30.0 pl can easily slide onthe 
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surface of a PTFE film made from pure PTFE solution (placed ona substrate 
witha tilt angle of 15.0°).d, Contour graph image of an aluminium electrode 
shows thatits surface is very flat and uniform. e, A droplet can slide off an 
aluminium electrode without leaving residual water. The aluminium electrode 
is placed ona substrate with a tilt angle of 25.0°. These is no residual water on 
the electrode surface. 
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Extended Data Fig. 2| Surface charging by continuous droplet impinging, droplets is 4.2 Hz and the total number of droplets is about 42) is roughly 
and characterization of the output charge and power ofthe DEG. a, Variation 49.8nC, whichis around 640.1 times higher than that of the control device 
in the output voltage as a function of the number of individual impinging (in black; the frequency of impinging droplets is 1.0 Hz, and the total number of 
droplets. In this case, the surface was not precharged. The output results dropletsis 9).c, When the load resistance increases from1kQto100 MO, the 
purely from charge generation and transfer during droplet impinging. b, The output current decreases from 250.0 pA to 2.0 pA. When the load resistance is 


output charge measured from the DEG (in red; the frequency of impinging 332.0 kQ, the peak output power density is50.1W m”. 
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Extended Data Fig. 3| Control experiment based onatriboelectric 
nanogenerator. a, Optical image showing the as-fabricated control device. 
The structure of this control device is similar to that of the DEG, but without an 
aluminium electrode. b, Diagram showing its detailed working mechanism. 

i, Before the droplet contacts the PTFE, the amount of (positive) charges onthe 
ITOis the same as the (negative) charges onthe PTFE, owing to electrostatic 
induction. Thus, there is no current flow from ITO to the ground. ii, Whena 
water droplet contacts the PTFE surface, the droplet becomes positively 
charged while the PTFE becomes more negatively charged as a result of contact 
electrification. iii, When the positively charged droplet leaves, it causes the ITO 
electrode with positive charges to be unable to screen the more negatively 
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charged PTFE. iv, Accordingly, a flow of current (/) between the ground and ITO 
electrode is induced by electrostatic induction. c, Variation in voltage output 
fromthe control device asa result of continuous droplet impinging. The inset 
shows the time-dependent variation in voltage froma single droplet. The 
frequency of impinging droplets is set at 1.0 Hz, witha total of nine droplets. 

d, Ineach test of the control device, the amount of transferred charge (in green) 
is identical to the charge carried by the departing droplets (in blue), showing 
that electricity generation from the control device indeed originates from 
contact electrification. The frequency of impinging dropletsis 1.0 Hz, witha 
total of nine droplets. 
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Extended Data Fig. 4 | Enhanced electrical output using the ion-injection 
method. a, Comparison of the output voltage generated froma single droplet 
impinging ona DEG that was precharged by droplet impinging (in red) or byion 
injection (in blue). b, Comparison of the amount of electrical energy (E,,,,) 
generated froma single droplet impinging ona DEG charged by droplet 
impinging (in red) or ioninjection (in blue). The instantaneous peak density 
can be enhanced further by increasing the surface charge onthe PTFE film 


through ion injection, using acommercial antistatic gun (Zerostat3, Milty) to 
inject variousions, including CO*, NO*, NO”, O* and O”, froma vertical 
distance of roughly 5.0cm.c, Variation in the measured maximum charge, Q,,,,, 
with droplet impinging ona PTFE surface that was precharged using ion 
injection. Q,,,, decays rapidly and finally reaches a stable value of roughly 
49.8nC. The inset shows the Q,,,, for the first four droplets. 
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Extended Data Fig. 5 | Effect of surface charge on the maximum spreading 
area, A,,,,, of adroplet. Data are means +s.e.m. For each mean, the total 
number of measurements is ten. 
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Extended Data Fig. 6 | Effect of the spatial location and width of the 
aluminium electrode on electricity generation. a, The spatial location of the 
aluminium electrode was changed, keeping the impact location fixed. In this 
way, the spacing between the droplet centre and the electrode can be tailored. 
Insets marked with 1, 2,3, 4 refer to the four different locations of the 
aluminium electrode onthe PTFE surface. The results show that regardless of 
the electrode location, the output voltage is constant, suggesting that 
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electricity generation is not sensitive to electrode location. Dataare 

means +s.e.m. For each mean, the total number of measurements is ten. b, The 
output voltage does not depend on the size of the aluminium electrode. This 
makes sense because the source of electricity generation is the 
electrostatically induced charge on the!ITO, rather than onthe aluminium. Data 
are means +s.e.m. For each mean, the total number of measurements is ten. 
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Extended Data Fig. 7 | Molecular-dynamics simulation. a, In the molecular- 
dynamics simulation, negative (blue) and positive (yellow) charges are fixed on 
atomic layers of PTFE (grey) and ITO (red), respectively. b, Molecular-dynamics 
simulation showing the distribution of mobile charges (Na* and CI) inside the 
water and onthe PTFE surface in switched-off mode (that is, without an 
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aluminium electrode, although the negative and positive charges on PTFE and 
ITO are turned on).c, Molecular-dynamics simulation showing the distribution 
of charges inside the water and onthe PTFE surface in switched-on mode. 

d, Comparison of the number of mobile charges transferred to the water/solid 
interface in switched-on and switched-off modes. 
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Extended Data Fig. 8 | Control devices made of porous PTFE, PDMS and PP. 
Comparison of the maximum stable surface charge, Q,,,,,on control devices 
made of porous PTFE, PDMS and polypropylene (PP) after continuous droplet 
impact under a relative humidity of 65.0%; all of these charges are much smaller 
than that of our DEG surface. 
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Extended Data Fig. 9 | Thickness of PTFE film asa function of the volume of 
PTFE precursor. The thickness of the PTFE film increases linearly from 6.7 ym 
to 16.9 pmas the volume of PTFE precursor increases from100.0 pl to 250.0 pl. 
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Simultaneously optimizing many design parameters in time-consuming experiments 
causes bottlenecks in a broad range of scientific and engineering disciplines'”. One 
such example is process and control optimization for lithium-ion batteries during 
materials selection, cell manufacturing and operation. A typical objective is to 
maximize battery lifetime; however, conducting evena single experiment to evaluate 
lifetime can take months to years* >. Furthermore, both large parameter spaces and 
high sampling variability**’ necessitate a large number of experiments. Hence, the 
key challenge is to reduce both the number and the duration of the experiments 
required. Here we develop and demonstrate a machine learning methodology to 
efficiently optimize a parameter space specifying the current and voltage profiles of 
six-step, ten-minute fast-charging protocols for maximizing battery cycle life, 

which can alleviate range anxiety for electric-vehicle users®”. We combine two key 
elements to reduce the optimization cost: an early-prediction model, which reduces 
the time per experiment by predicting the final cycle life using data from the first few 
cycles, and a Bayesian optimization algorithm’””, which reduces the number of 
experiments by balancing exploration and exploitation to efficiently probe the 
parameter space of charging protocols. Using this methodology, we rapidly identify 
high-cycle-life charging protocols among 224 candidates in 16 days (compared with 
over 500 days using exhaustive search without early prediction), and subsequently 
validate the accuracy and efficiency of our optimization approach. Our closed-loop 
methodology automatically incorporates feedback from past experiments to inform 
future decisions and can be generalized to other applications in battery design and, 
more broadly, other scientific domains that involve time-intensive experiments and 
multi-dimensional design spaces. 


Optimal experimental design (OED) approaches are widely used to 
reduce the cost of experimental optimization. These approaches 
often involve a closed-loop pipeline where feedback from completed 
experiments informs subsequent experimental decisions, balancing 
the competing demands of exploration—that is, testing regions of the 
experimental parameter space with high uncertainty—and exploita- 
tion—that is, testing promising regions based on the results of the com- 
pleted experiments. Adaptive OED algorithms have been successfully 
applied to physical science domains, such as materials science’? “, 
chemistry”’, biology” and drug discovery’, as well as to computer 
science domains, such as hyperparameter optimization for machine 
learning’”°. However, while a closed-loop approach is designed to 


minimize the number of experiments required for optimizing across a 
multi-dimensional parameter space, the time (and cost) per experiment 
may remain high, as is the case for lithium-ion batteries. Therefore, an 
OED approach should account for both the number of experiments 
and the cost per experiment. Multi-fidelity optimization approaches 
have been developed to learn from both inexpensive, noisy signals and 
expensive, accurate signals. For example, in hyperparameter optimiza- 
tion for machine learning algorithms, several low-fidelity signals for 
predicting the final performance of an algorithmic configuration (for 
example, extrapolated learning curves’””, rapid testing on a subset 
of the full training dataset”) are used in tandem with more complete 
configuration evaluations”””’. For lithium-ion batteries, classical 
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Fig. 1| Schematic of our CLO system. First, batteries are tested. The cycling 
data from the first 100 cycles (specifically, electrochemical measurements 
suchas voltage and capacity) are used as input for an early outcome prediction 
of cycle life. These cycle life predictions from a machine learning (ML) model 
are subsequently sent toa BO algorithm, which recommends the next 
protocols to test by balancing the competing demands of exploration (testing 
protocols with high uncertainty in estimated cycle life) and exploitation 


methods such as factorial design that use predetermined heuristics 
to select experiments have been applied”*”°, but the design and use 
of low-fidelity signals is challenging and unexplored. These previously 
considered approaches do not discover and exploit the patterns present 
inthe parameter space for efficient optimization, nor dothey address 
the issue of time per experiment. 

In this work, we develop a closed-loop optimization (CLO) system 
with early outcome prediction for efficient optimization over large 
parameter spaces with expensive experiments and high sampling 
variability. We employ this system to experimentally optimize fast- 
charging protocols for lithium-ion batteries; reducing charging times 
to approach gasoline refuelling time is critical to reduce range anxiety 
for electric vehicles®’ but often comes at the expense of battery life- 
time. Specifically, we optimize over a parameter space consisting of 
224 unique six-step, ten-minute fast-charging protocols (that is, how 
current and voltage are controlled during charging) to find charging 
protocols with high cycle life (defined as the battery capacity falling 
to 80% of its nominal value). Our system uses two key elements to 
reduce the optimization cost (Extended Data Fig. 1). First, we reduce 
the time per experiment by using machine learning to predict the out- 
come of the experiment based on data from early cycles, well before 
the batteries reach the end of life®. Second, we reduce the number 
of experiments by using a Bayesian optimization (BO) algorithm to 
balance the exploration-exploitation tradeoff in choosing the next 
round of experiments”. Testing a single battery to failure under our 
fast-charging conditions requires approximately 40 days, meaning 
that when 48 experiments are performed in parallel, assessing all 224 
charging protocols with triplicate measurements takes approximately 
560 days. Here, using CLO with early outcome prediction, only 16 days 
were required to confidently identify protocols with high cycle lives 
(48 parallel experiments). Ina subsequent validation study, we find that 
CLO ranks these protocols by lifetime accurately (Kendall rank correla- 
tion coefficient, 0.83) and efficiently (15 times less time than a baseline 
‘brute-force’ approach that uses random search without early predic- 
tion). Furthermore, we find that the charging protocols identified as 
optimal by CLO with early prediction outperform existing fast-charging 
protocols designed to avoid lithium plating (a common fast-charging 
degradation mode), the approach suggested by conventional battery 
wisdom**°”*, This work highlights the utility of combining CLO with 
inexpensive early outcome predictors to accelerate scientific discovery. 
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(testing protocols with high estimated cycle life). This process iterates until the 
testing budget is exhausted. In this approach, early prediction reduces the 
number of cycles required per tested battery, while optimal experimental 
design reduces the number of experiments required. A small training dataset 
of batteries cycled to failure is used both to train the early outcome predictor 
and to set BO hyperparameters. In future work, design of battery materials and 
processes could also be integrated into this closed-loop system. 


CLO with early outcome prediction is depicted schematically in Fig. 1. 
The system consists of three components: parallel battery cycling, an 
early predictor for cycle life and a BO algorithm. At each sequential 
round, we iterate over these three components. The first component 
is a multi-channel battery cycler; the cycler used in this work tests 48 
batteries simultaneously. Before starting CLO, the charging proto- 
cols for the first round of 48 batteries are chosen at random (without 
replacement) fromthe complete set of 224 unique multi-step protocols 
(Methods). Each battery undergoes repeated charging and discharging 
for 100 cycles (about 4 days; average predicted cycle life 905 cycles), 
beyond which the experiments are terminated. 

These cycling data are then fed as input to the early outcome predic- 
tor, which estimates the final cycle lives of the batteries given data from 
the first 100 cycles. The early predictor is a linear model trained via 
elastic net regression” on features extracted from the charging data of 
the first 100 cycles (Supplementary Table 1), similar to that presented 
in Severson etal.°. Predictive features include transformations of both 
differences between voltage curves and discharge capacity fade trends. 
To train the early predictor, we require a training dataset of batteries 
cycled to failure. Here, we used a pre-existing dataset of 41 batteries 
cycled to failure (cross-validation root-mean-square error, 80.4 cycles; 
see Methods and Supplementary Discussion 1). Whereas obtaining 
this dataset itself requires running full cycling experiments for a small 
training set of batteries (the cost we are trying to offset), this one-time 
cost could be avoided if pretrained predictors or previously collected 
datasets are available. If unavailable, we pay an upfront cost in collecting 
this dataset; this dataset could also be used for warm-starting the BO 
algorithm. The size of the dataset collected should best tradeoff the 
upfront cost in acquiring the dataset to train an accurate model with 
the anticipated reduction in experimentation requirements for CLO. 

Finally, these predicted cycle lives from early-cycle data are fed into 
the BO algorithm (Methods and Supplementary Discussion 2), which 
recommends the next round of 48 charging protocols that best balance 
the exploration-exploitation tradeoff. This algorithm (Methods and 
Supplementary Discussion 2) builds on the prior work of Hoffman 
etal.!°and Grover et al.". The algorithm maintains an estimate of both 
the average cycle life and the uncertainty bounds for each protocol; 
these estimates are initially equal for all protocols and are refined as 
additional data are collected. Crucially, to reduce the total optimiza- 
tion cost, our algorithm performs these updates using estimates from 
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Fig. 2| Structure of our six-step, ten-minute fast-charging protocols. 
Currents are defined as dimensionless C rates; here, 1Cis1.1A, or the current 
required to fully (dis)charge the nominal capacity (1.1Ah) inlh.a, Current 
versus SOC for an example charging protocol, 7.0C-4.8C-5.2C-3.45C (bold 
lines). Each charging protocol is defined by five constant current (CC) steps 
followed by one constant voltage (CV) step. The last two steps (CC5 and CV1) 
are identical for all charging protocols. We optimize over the first four 
constant-current steps, denoted CC1, CC2, CC3 and CC4. Each of these steps 
comprises a 20% SOC window, such that CC1 ranges from 0% to 20% SOC, CC2 


the early outcome predictor instead of using the actual cycle lives. The 
mean and uncertainty estimates for the cycle lives are obtained viaa 
Gaussian process (Methods), which has asmoothing effect and allows 
for updating the cycle life estimates of untested protocols with the 
predictions from related protocols. The closed-loop process repeats 
until the optimization budget, in our case 192 batteries tested (100 
cycles each), is exhausted. 

Our objective is to find the charging protocol which maximizes the 
expected battery cycle life for a fixed charging time (ten minutes) and 
state-of-charge (SOC) range (0 to 80%). The design space of our 224 six- 
step extreme fast-charging protocols is presented in Fig. 2a. Multi-step 
charging protocols, in which a series of different constant-current steps 
are applied within a single charge, are considered advantageous over 
single-step charging for maximizing cycle life during fast charging*®, 
though the optimal combination remains unclear. As shown in Fig. 2b, 
each protocolis specified by three independent parameters (CC1, CC2 
and CC3); each parameter is a current applied over a fixed SOC range 
(0-20%, 20-40% and 40-60%, respectively). A fourth parameter, CC4, 
is dependent on CC1, CC2, CC3 and the charging time. Given constraints 
onthe current values (Methods), a total of 224 charging protocols are 
permitted. We test commercial lithium iron phosphate (LFP)/graphite 
cylindrical batteries (A123 Systems) in a convective environmental 
chamber (30 °C ambient temperature). A maximum voltage of 3.6 
Vis imposed. These batteries are designed to fast-charge in17 min 
(rate testing data are presented in Extended Data Fig. 2). The cycle life 
decreases dramatically with faster charging time*°, motivating this 
optimization. Since the LFP positive electrode is generally considered 
to be stable*, we select this battery chemistry to isolate the effects of 
extreme fast charging on graphite, which is universally employed in 
lithium-ion batteries. 

In all, we ran four CLO rounds sequentially, consisting of 185 bat- 
teries in total (excluding seven batteries; see Methods). Using early 
prediction, each CLO round requires four days to complete 100 cycles, 
resulting ina total testing time of sixteen days—a major reduction from 
the 560 days required to test each charging protocol to failure three 
times. Figure 3 presents the predictions and selected protocols (Fig. 3a), 
as well as the evolution of cycle life estimates over the parameter space 


2.75 


ranges from 20% to 40% SOC, and so on. CC4 is constrained by specifying that 
all protocols charge in the same total time (10 min) from 0% to 80% SOC. Thus, 
our parameter space consists of unique combinations of the three free 
parameters CC1, CC2 and CC3. For eachstep, we specify a range of acceptable 
values; the upper limit is monotonically decreasing with increasing SOC to 
avoid the upper cutoff potential (3.6 V for all steps). b, CC4 (colour scale) asa 
function of CC1, CC2 and CC3 (on thex, yand zaxes, respectively). Each point 
represents a unique charging protocol. 


as the optimization progresses (Fig. 3a). Initially, the estimated cycle 
lives for all protocols are equal. After two rounds, the overall structure 
of the parameter space (that is, the dependence of cycle life on charg- 
ing protocol parameters CC1, CC2 and CC3) emerges, anda prominent 
region with high cycle life protocols has been identified. The confidence 
of CLO inthis high-performing region is further improved from round 
2to round 4, but overall the cycle life estimates do not change substan- 
tially (Extended Data Fig. 3). By learning and exploiting the structure 
of the parameter space, we avoid evaluating charging protocols with 
low estimated cycle life and concentrate more resources on the high- 
performing region (Extended Data Figs. 3-5). Specifically, 117 of 224 
protocols are never tested (Fig. 3c); we spend 67% of the batteries test- 
ing 21% of the protocols (0.83 batteries per protocol on average). CLO 
repeatedly tests several protocols with high estimated cycle life to 
decrease uncertainties due to manufacturing variability and the error 
introduced by early outcome prediction. The uncertainty is expressed 
as the prediction intervals of the posterior predictive distribution over 
cycle life (Extended Data Figs. 3g, 4, 5). 

To the best of our knowledge, this work presents the largest known 
map of cycle life as a function of charging conditions (Extended Data 
Fig. 5). This dataset can be used to validate physics-based models of 
battery degradation. Most fast-charging protocols proposed in the 
battery literature suggest that current steps decreasing monotonically 
as a function of SOC are optimal to avoid lithium plating on graphite, 
a well-accepted degradation mode during fast charging**°”*. In con- 
trast, the protocols identified as optimal by CLO (for example, Fig. 3d) 
are generally similar to single-step constant-current charging (that 
is, CC1 = CC2 = CC3 = CC4). Specifically, of the 75 protocols with the 
highest estimated cycle lives, only ten are monotonically decreasing 
(that is, CC;>CC,,, for all ) and two are strictly decreasing (that is, CC;> 
CC;,,). We speculate that minimizing parasitic reactions caused by heat 
generation may be the operative optimization strategy for these cells, 
as opposed to minimizing the propensity for lithium plating (Supple- 
mentary Discussion 3). While the optimal protocol for anew scenario 
would depend onthe selected charge time, SOC window, temperature 
control conditions and battery chemistry, this unexpected result high- 
lights the need for data-driven approaches for optimizing fast charging. 
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Fig.3| Results of closed-loop experiments. a, Early cycle life predictions per 
round. The tested charging protocols and the resulting predictions are plotted 
for rounds 1-4. Each point represents a charging protocol, defined by CC1, CC2 
and CC3 (thex, yandzaxes, respectively). The colour scale represents cycle life 
predictions from the early outcome prediction model. The charging protocols 
inthe first round of testing are randomly selected. As the BO algorithm shifts 
from exploration to exploitation, the charging protocols selected for testing 
by the closed loop in subsequent rounds fall primarily into the high-performing 
region. b, Evolution of the parameter space per round. The colour scale 
represents cycle life, as estimated by the BO algorithm. The initial cycle life 


We validate the performance of CLO with early prediction ona subset 
of nine extreme fast-charging protocols. For each of these protocols, 
we cycle five batteries each to failure and use the sample average of the 
final cycle lives as an estimate of the true lifetimes. We use this valida- 
tion study to (1) confirm that CLO is able to correctly rank protocols 
based on cycle life, (2) compare the cycle lives of protocols recom- 
mended by CLO to protocols inspired by the battery literature and 
(3) compare the performance of CLO to baseline ablation approaches 
for experimental design. The charging protocols used in validation, 
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estimates are equivalent for all protocols; as more predictions are generated, 
the BO algorithm updates its cycle life estimates. The CLO-estimated mean 
cycle lives after four rounds for all fast-charging protocols inthe parameter 
space are also presented in Extended Data Fig. 5 and Supplementary Table 3. 
c, Distribution of the number of repetitions for each charging protocol 
(excluding failed batteries). Only 46 of 224 protocols (21%) are tested multiple 
times. d, Current versus SOC for the top three fast-charging protocols, as 
estimated by CLO. CC1-CC4 are displayed in the legend. All three protocols 
have relatively uniform charging (that is, CC1 = CC2=CC3=CC4). 


some of whichare inspired by existing battery fast-charging literature 
(see Methods), span the range of estimated cycle lives (Extended Data 
Fig. 6 and Extended Data Table 1). We adjust the voltage limits and 
charging times of these literature protocols to match our protocols, 
while maintaining similar current ratios as a function of SOC. Whereas 
the literature protocols used in these validation experiments are gener- 
ally designed for batteries with high-voltage positive electrode chem- 
istries, fast-charging optimization strategies generally focus on the 
graphitic negative electrode**. For these nine protocols, we validate 
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Fig. 4| Results of validation experiment. a, Discharge capacity versus cycle 
number for all batteries in the validation experiment. The nine validation 
protocols include the top three protocols as estimated by CLO (‘CLO top 3’), 
four protocols inspired by the battery literature ** (‘Literature-inspired’) and 
two protocols selected to obtaina representative sampling from the 
distribution of CLO-estimated cycle lives among the validation protocols 
(‘Other’). b, Comparison of early-predicted cycle lives from validation to 
closed-loop estimates, averaged ona protocol basis. Each ten-minute charging 
protocolis tested with five batteries. Error bars represent the 95% confidence 
intervals. c, Observed versus early-predicted cycle life for the validation 
experiment. Although our early predictor tends to overestimate cycle life, 


the ‘CLO-estimated’ cycle lives against the sample average of the five 
final cycle lives. 

The validation results are presented in Fig. 4. The discharge capacity 
fade curves (Fig. 4a) exhibit the nonlinear decay typical of fast charg- 
ing*”. If we apply our early-prediction model to the batteries in the 
validation experiment, these early predictions (averaged over each 
protocol) match the CLO-estimated mean cycle lives well (Pearson 
correlation coefficient r= 0.93; Fig. 4b). This result validates the per- 
formance of the BO component of CLO in particular, since the CLO- 
estimated cycle lives were inferred from early predictions. However, 
our early-prediction model exhibits some bias (Fig. 4c), probably owing 
to calendar ageing effects from different battery storage times”® (Sup- 
plementary Table 2 and Supplementary Discussion 4). Despite this bias 
in our predictive model, we generally capture the ranking well (Kendall 
rank correlation coefficient, 0.83; Fig. 4d and Extended Data Fig. 7). 
At the same time, we note that the final cycle lives for the top-ranked 
protocols are similar. Furthermore, the optimal protocols identified 
by CLO outperform protocols inspired by previously published fast- 
charging protocols (895 versus 728 cycles on average; Extended Data 
Fig. 6 and Extended Data Table 1). This result suggests that the efficiency 
of our approach does not come at the expense of accuracy. 


probably owing to calendar ageing effects (Supplementary Discussion 4), the 
trend is correctly captured (Pearson correlation coefficient r= 0.86). d, Final 
cycle lives from validation, sorted by CLO ranking. The length of each bar and 
the annotations represents the mean final cycle life from validation per 
protocol. Error bars represent the 95% confidence intervals. e, Ablation study 
of various optimization approaches using the protocols and datainthe 
validation set (Methods). Error bars represent the 95% confidence intervals 
(n=2,000). With contributions from both early prediction and Bayesian 
optimization, CLO can rapidly identify high-performing charging protocols. 
The gains from Bayesian optimization are larger when resources are 
constrained (Extended Data Fig. 8). 


Our method greatly reduces the optimization time required compared 
to baseline optimization approaches (Fig. 4e). For instance, aprocedure 
that does not use early outcome prediction and simply selects protocols 
randomly to test begins to saturate at a competitive performance level 
after about 7,700 battery-hours of testing. To achieve a similar level of 
performance, CLO with both early outcome prediction and the BO algo- 
rithm requires only 500 battery-hours of testing. For this small-scale vali- 
dation experiment, we observe that the early-prediction component of 
CLO greatly reduces the time per experiment. Here, random selection is 
equivalent toa pure exploration strategy and can achieve a performance 
similar to the BO-based approaches for smaller experimental budgets. In 
later stages, random selection is eventually outperformed by BO-based 
approaches, which exploit the structure across the protocols and focus 
on reducing the uncertainty in the promising regions of the parameter 
space. Although these results are specific to this validation study, we 
observe similar or larger gains in simulations when fewer batteries or 
fewer parallel experiments (relative to the size of the parameter space) 
are available (Extended Data Fig. 8). The relative gains from BO over 
random selection are largest with minimal resources. 

Finally, we compare our early predictor with other low-fidelity predic- 
tors proposed in state-of-the-art multi-fidelity optimization algorithms 
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inthe literature”, and find that our approach outperforms these algo- 


rithms (Supplementary Discussion 2 and Supplementary Table 4). The 
generic early-prediction models in these previous works fit composites 
of parametric functions to the capacity fade curves, while our model 
uses additional features recorded at every cycle (for example, voltage). 
This result highlights the value of designing predictive models for the 
target application in multi-fidelity optimization. 

In summary, we have successfully accelerated the optimization of 
extreme fast charging for lithium-ion batteries using CLO with early 
outcome prediction. This method could extend to other fast-charging 
design spaces, suchas pulsed”®*”* and constant-power® charging, as well 
as to other objectives, such as slower charging and calendar ageing. 
Additionally, this work opens up new applications for battery optimiza- 
tion, suchas formation”, adaptive cycling*’ and parameter estimation 
for battery management system models”. Furthermore, provided that 
a suitable early outcome predictor exists, this method could also be 
applied to optimize other aspects of battery development, such as 
electrode materials and electrolyte chemistries. Beyond batteries, our 
CLO approach combining black-box optimization with early outcome 
prediction can be extended to efficiently optimize other physical’? 
and computational”*” multi-dimensional parameter spaces that 
involve time-intensive experimentation, illustrating the power of 
data-driven methods to accelerate the pace of scientific discovery. 
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Methods 


Experimental 

Commercial high-power lithium iron phosphate (LFP)/graphite A123 
APRI18650MI1A cylindrical cells were used in this work (packing date 
2015-09-26, lot number EL1508007-R). These cells have a nominal 
capacity of 1.1A hand anominal voltage of 3.3 V. All currents are defined 
in units of C rate; here, 1C is 1.1A, or the current required to fully (dis) 
charge the nominal capacity (1.1A h) inl h. The manufacturer’s rec- 
ommended fast-charging protocol is 3.6C (3.96 A) CC-CV. The rate 
capability of these cells is shown in Extended Data Fig. 2. The graphite 
and LFP electrodes are 40 pm thick and 80 um thick, respectively, as 
quantified via X-ray tomography (Zeiss Xradia 520 Versa). 

The cells were cycled with various charging protocols but identically 
discharged. Cells were charged with one of 224 candidate six-step, ten- 
minute charging protocols from 0% to 80% SOC, as detailed below. After 
a five-second rest, all cells then charged from 80% to 100% SOC with 
a1C CC-CV charging step to 3.6 V and a current cutoff of C/20. After 
another five-second rest, all cells subsequently discharged with aCC-CV 
discharge at 4C to 2.0 V anda current cutoff of C/20. The cells rested 
for another five seconds before the subsequent charging step started. 
The lower and upper cutoff voltages were 2.0 V and 3.6 V, respectively, 
as recommended by the manufacturer. In this work, cycle life is defined 
as the number of cycles until the discharge capacity falls below 80% of 
the nominal capacity. 

Allcells were tested in cylindrical fixtures with 4-point contacts ona 
48-channel Arbin Laboratory Battery Testing battery cycler placed in 
an environmental chamber (Amerex Instruments) at 30 °C. The cycler 
calibration was validated before the state of the experiment. 

Inthe closed-loop experiment, four experiments did not reach 100 
cycles owing to contact issues either at the start or partially through 
the experiment. These experiments were run on channels 17 and 27 in 
round 1 (oed_O) and channels 4 and 5 in round 2 (oed_1). Additionally, 
in each round, one protocol per round that should have been selected 
(that is, with atop-48 upper bound) was not selected and replaced with 
the protocol with the 49th-highest upper bound owing to a process- 
ing error (Extended Data Fig. 4), but this error is not expected to have 
a large effect. Additional experimental issues are documented in the 
notes of the data repository. 


Charging protocol and parameter space design 

Cells were charged with one of 224 different four-step charging proto- 
cols. Each of the first four steps is a single constant-current step applied 
over a20% SOC range; thus, the 224 charging protocols represent dif- 
ferent combinations of current steps within the 0% to 80% SOC range. 
We can define the charging time from 0% to 80% SOC by: 


: 0.2, 0.2 , 0.2 , 0.2 
080%" CCl CC2 CC3 CC4 


In all protocols considered here, we constrain f5.g9, to be 10 min. We 
now write CC4 as a function of the first three charging steps, as: 


0.2 
CC4= 5 of 02 , 02 ) 
60 \cci* cc2  cc3 


Thus, each protocol can be uniquely defined by CC1, CC2 and CC3. 

Each independent parameter can take on one of the following dis- 
crete values: 3.6C, 4.0C, 4.4C, 4.8C, 5.2C and 5.6C. Furthermore, CC1 
can take on values of 6.0C, 7.0C and 8.0C, and CC2 can take on values 
of 6.0C and 7.0C. CC4 is not allowed to exceed 4.81C. The maximum 
allowable current for each parameter decreases with increasing SOC to 
avoid reaching the upper cutoff voltage of 3.6 V. With these constraints, 
a total of 224 charging protocols are permitted. 


For aconsistent protocol nomenclature, we define each fast-charging 
protocol as CC1-CC2-CC3-CC4. For example, the charging protocol 
with the highest CLO-estimated mean cycle life is written 4.8C-5.2C- 
5.2C-4.160C. 


Early outcome predictor 

The early outcome predictor for cycle life is similar to that presented 
in Severson et al.°. This linear model predicts the final log,, cycle life 
(number of cycles to reach 80% of nominal capacity, or 0.88 A h) using 
features from the first 100 cycles. The training set is identical tothe one 
used in Severson et al.° and consists of 41 batteries. The linear model 
takes the form: 


Here y, is the predicted cycle life for battery i, x;is a p-dimensional 
feature vector for battery iand Wis a p-dimensional model coefficient 
vector. Features are z-scored (mean-subtracted and normalized by the 
standard deviation) to the training set before model evaluation. 

Regularization, or simultaneous feature selection and model fitting, 
was performed using the elastic net”’. Regularization penalizes overly 
complex fits to improve both generalizability and interpretability. 
Specifically, the coefficient vector Wis found via the following expres- 
sion: 


ws : l-a 
w= argmin,) I~ Xuld +A(—= lll + ali) 


Here A and a are hyperparameters; A is a non-negative scalar and aisa 
scalar between O and 1. The first term minimizes the squared loss, and 
the second term performs both continuous shrinkage and automatic 
feature selection. During model development, we apply fourfold cross- 
validation and Monte Carlo sampling with the training set to optimize 
the values of the hyperparameters A and a. 

As in Seversonetal.°, the available features were based on the differ- 
ence between discharge voltage curves of cycles 100 and 10, or trends 
inthe discharge capacity. The five selected features, their correspond- 
ing weights and the z-scored values are presented in Supplementary 
Table 1. The training (cross-validated) error was 80.4 cycles (10.2%); 
the test error ona test set from Severson et al.° was 122 cycles (12.6%). 

The early predictor automatically flags predictions as anomalous if 
the 95% prediction interval exceeds 2,000 cycles. The two-tailed 95% 
prediction interval is computed by: 


95%PI = 2t(a/2,n-p) * RMSE.J1 +x) (X'X) "X%; 


where Cis the Student’s ¢ value, ais the significance level (0.05 for 
95% confidence), nis the number of samples, p is the number of fea- 
tures, RMSE is the root-mean-square error of the training set (in units 
of cycles), x,is the vector of selected features for battery iand X is 
the matrix of selected features for all observations in the training 
set. 

In the closed-loop experiment, three tests returned predictions 
with a prediction interval outside of the threshold; these anoma- 
lous predictions were excluded. These tests were run on channel 
27 in round 1 (oed_0), channel 12 in round 3 (oed_2) and channel 6 in 
round 4 (oed_3). Furthermore, in the validation experiment, one test 
returned a prediction with a prediction interval outside of the thresh- 
old (channel 12; 3.6C-6.0C-5.6C-4.755C), although the final cycle life 
was reasonable. 

We note that the predictions from this model exhibited systematic 
bias for the cells in the validation experiments, which we attribute to 
the increased calendar ageing of these cells relative to the training set 
(Supplementary Table 2 and Supplementary Discussion 4). 
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Bayesian optimization algorithm 

To perform optimal experimental design, we consider the setting of 
best-arm identification using multi-armed bandits. Here each arm is 
acharging protocol and the goal is to identify the best arm, or equiva- 
lently the charging protocol with the highest expected cycle life. Many 
variants of the problem have been studied in prior work™ *; our algo- 
rithm builds on the approaches of Hoffman et al.!° and Grover et al.". 
We consider further modifications in Supplementary Discussion 2. 

In particular, we assume a Bayesian regression setting, where there 
exists an unknown set of parameters (0 € R’) that relate a charging 
protocol x to its cycle life (a scalar) via a Gaussian likelihood function. 
Here, x denotes the CC1, CC2, CC3 configurations of a charging pro- 
tocol, whichis projected onto ad-dimensional feature vector @(x). We 
set d= 224, and the feature representations @(x) are obtained by 
approximating a radial-basis function kernel, K(x; x;) = exp(yllx-351I3), 
using Nystroem’s method. Here, x; and x; are the CC1, CC2 and CC3 
configurtions for two arbitrary charging protocols and the inverse of 
the kernel bandwidth, y > O is treated as a hyperparameter. 

The Gaussian likelihood function relates a charging protocol to its 
cycle life distribution. For a protocol x, the mean of this likelihood 
function is given as 6'@(x). The variance of this likelihood function 
is the sum of two uncertainty terms, both of which we assume to be 
homoskedastic (that is, uniform across all protocols). The first term is 
the empirical variance averaged across the repeated runs of individual 
protocols present in the training dataset (same as that used for training 
the early predictor). This accounts for variability due to exogenous 
factors such as manufacturing. Second, since we do not wait for an 
experiment to complete, the likelihood variance additionally needs 
to accommodate an additional uncertainty term due to the early out- 
come prediction component of the pipeline. We do so by computing 
the residual variance of the early predictions on the held-out portion 
of the dataset and set the aforementioned uncertainty term to be the 
maximum of the residual variances. We assume that the two sources 
of uncertainty are independent, and hence the overall variance of the 
likelihood distribution is given by the sum of the squares of both vari- 
ance terms described above. 

To perform inference over the unknown parameters 6 and subse- 
quent predictions of cycle lives, we employ a Gaussian process. Ina 
Gaussian process, the prior over 6 is assumed to be isotropic Gauss- 
ian; sucha prior is conjugate to the Gaussian likelihood, and as a con- 
sequence the Gaussian posterior can be obtained in closed-form via 
the Bayes rule. This posterior is used to define a Gaussian predictive 
distribution over the cycle life for any given charging protocol with 
mean pand variance o*. 

Finally, to select a charging protocol, we optimize an acquisition 
function based on upper confidence bounds. The acquisition function 
selects protocols where the noisy predictive distribution over cycle 
life has high mean p (to encourage exploitation) and high variance o” 
(to encourage exploration). The mean and upper and lower confidence 
bounds for any arm iis given by y, ;+ B,0,;at round k, such that the rela- 
tive weighting of the two terms is controlled by the exploration tradeoff 
hyperparameter, £ > O. The exploration tradeoff hyperparameter at 
round k, B,, is decayed multiplicatively at every round of the closed 
loop by another hyperparameter, ¢ €(0,1], as given by B, = Boe*. 


BO hyperparameter optimization 

The BO algorithm relies on eight hyperparameters, each of which 
can be categorized as either a resource hyperparameter, a parameter 
space hyperparameter or an algorithm hyperparameter. We note that 
the BO algorithm runs in the fixed-budget setting; here, the budget 
refers to the number of iterations of the closed loop we run, exclud- 
ing validation experiments. We describe each category of hyperpa- 
rameters below; the values of each hyperparameter are tabulated in 
Supplementary Table 5. 


Resource hyperparameters are specified by the available testing 
resources. The ‘batch size’ represents the number of parallel tests. We 
seta batch size of 48 given our 48-channel battery cycler. The ‘budget’ 
represents the number of batches tested during CLO. The budget 
excludes batches used to develop the early predictor and validation 
batches. The budget is typically constrained by either the available 
testing time or the number of cells. In this case, we set a budget of 4, 
yielding a cell budget of 192 cells and a time budget of 16 days (4 days 
per batch of 48 cells tested for 100 cycles). 

Parameter space hyperparameters are specified by the optimization 
problem. Here, we use the same data available from the training set of 
the early predictor to estimate these parameters, despite a different 
charging protocol structure. The ‘standardization mean’ represents 
the estimated mean cycle life across all protocols. The ‘standardiza- 
tion standard deviation’ represents the estimated standard deviation 
of cycle life across all protocols; in other words, this parameter repre- 
sents the range of cycle lives in the parameter space. The ‘likelihood 
standard deviation’ represents the estimated standard deviation of a 
single protocol tested multiple times, which is a measure of the sam- 
pling error; this sampling error includes both the intrinsic variability 
and the prediction error. 

Algorithm hyperparameters control the performance of the Bayesian 
optimization algorithm. y is the kernel bandwidth, which controls the 
interaction strength between neighbouring protocols inthe parameter 
space. High y favours under-smoothing of the parameter space, that is, 
the protocols have weak relationships with their neighbours. By repre- 
sents the initial value of £, the exploration tradeoff hyperparameter; 8 
controls the balance of exploration versus exploitation. High B, favours 
exploration over exploitation. € represents the decay constant of beta 
per round; as the experiment progresses, ¢ shifts towards stronger 
exploitation (given by 2, = Bye, where 2, represents the exploration 
constant at round k, O-indexed). High ¢ favours a rapid transition from 
exploration to exploitation. 

The algorithm hyperparameters were estimated by creating a phys- 
ics-based simulator based on the range of cycle lives obtained in the 
preliminary batch, testing all hyperparameter combinations on the 
simulator, and selecting the hyperparameter combination with the 
best performance (that is, that which most consistently obtains the 
true cycle life). These results are visualized in Extended Data Fig. 9; we 
note that the performance of BO is relatively insensitive to the selected 
combination of algorithm hyperparameters, meaning sufficiently 
high performance can be achieved even with suboptimal algorithm 
hyperparameters. Other approaches, such as using the early-predictor 
training dataset, are also possible for optimization of the algorithm 
hyperparameters (Supplementary Discussion 1). 


Physics-based simulator 

We used a physics-based simulator for hyperparameter optimization; 
this simulator allows us to estimate the shape and range of cycle lives 
in the parameter space, although the simulator is not designed to be 
an accurate representation of battery degradation during fast charg- 
ing. This finite element simulator was originally designed to estimate 
the heat generation during charging in an 18650 cylindrical battery by 
approximating the battery as along cylinder, which simplifies to a one- 
dimensional radial heat transfer problem. The equations and thermal 
properties were sourced from Drake et al.*° and Cengel and Boles”. The 
output from these simulations is a matrix of temperature as a func- 
tion of both radial position and time. We use total solid-electrolyte 
interphase (SEI) growthas a proxy for degradation. First, we estimate 
the temperature dependence of SEI growth from the C/10 series of 
figure 7 from Smith et al.** (Supplementary Table 6). Simultaneously, 
we compute the expected temperature profiles in the battery as a func- 
tion of charging protocol with respect to time and position. We then 
approximate the kinetics of SEI growth with an Arrhenius equation, 
such that SEI growth increases with increasing temperature. SEI growth 


(in arbitrary units) is calculated for each temperature element in the 
position-time array via: 


E, 
D= 2d, Dn eval F| 
where D is the degradation parameter, £, is the effective activation 
energy for SEI growth (Supplementary Table 6) and k, is Boltzmann’s 
constant. The cycle life is then calculated from the degradation param- 
eter using the range of expected cycle lives (as estimated from the 
early-predictor training dataset): 


Cycle life = 500+C/D 


where Cis a constant (5 x 10) that scales D to reasonable values of 
cycle life. 


Validation experiments 

After the closed-loop experiment completed, we selected nine pro- 
tocols to test to failure (five batteries per charging protocol). This 
experiment allowed us to (1) evaluate the performance of the closed 
loop by comparing the CLO-estimated mean cycle lives to the mean 
cycle life of multiple batteries tested to failure for multiple protocols, 
(2) compare the protocols with the highest CLO-estimated mean cycle 
lives to conventional fast-charging protocol design principles from 
the battery literature, and (3) generate a small dataset with which we 
can evaluate the performance of the closed loop relative to baseline 
optimization approaches. 

The selected protocols are displayed in Extended Data Fig. 6 and 
Extended Data Table 1. Of our nine fast-charging protocols, three were 
the top three CLO-estimated protocols; four were based on approxima- 
tions of multi-step fast-charging protocols in the battery literature (see 
Extended Data Table 1); and two were selected to obtain a representa- 
tive sampling from the distribution of CLO-estimated cycle lives. The 
four protocols based on approximations of multi-step fast-charging 
protocols in the battery literature were obtained by determining the 
current ratios between various steps and translating those ratios to 
our ten-minute fast-charging space. The voltage limits were consistent 
with our charging protocols, that is, 2.0 V and 3.6 V. 

Five batteries per charging protocol were tested to obtain a rea- 
sonable estimate of the true cycle lives. In this experiment, one test 
returned a prediction with a prediction interval outside of the threshold 
(channel 12; 3.6C-6.0C-5.6C-4.755C) and was excluded. A comparison 
of the three different methods for cycle life results (CLO, early predic- 
tions from validation, and final measurements from validation) are 
presented in Extended Data Fig. 7. 


Validation ablation study 

For the ablation study using the charging protocols and data fromthe 
validation experiments, we systematically compared the full closed- 
loop system against three other ablation baselines which use (1) only 
early prediction (no BO exploration-exploitation, purely random 
exploration), (2) only BO exploration—-exploitation (no early predic- 
tion), (3) purely random exploration without any early prediction. As 
highlighted earlier, since the final cycle lives for the protocols in the 
validation study have a noticeable bias that can be explained by calen- 
dar ageing (Supplementary Discussion 4), we performa simple additive 
bias correction for each of the final cycle lives beforehand to suppress 
any undesirable influence of this bias in interpreting the results. 

We run the four ablation baselines for a varying number of sequential 
rounds. Since our validation space is relatively small (nine charging 
protocols, five batteries tested per protocol in our validation dataset), 
we run only one battery per round (that is, we assume a one-channel 
battery cycler). The baselines that use BO exploration-exploitation 
additionally require hyperparameters to be specified before beginning 


the experiment, as described in the Methods section ‘BO hyperparam- 
eter optimization’. The best hyperparameters are chosen separately for 
each round based on the performance obtained on the physics-based 
simulator, averaged over 100 random seeds. 

When an ablation baseline queries for the cycle life of a given charging 
protocol, the returned value corresponds to one of the five runs in our 
validation dataset, chosen via random sampling with replacement (that 
is, bootstrapped). The experimental time cost of this query is equal to 
100 cycles for ablation baselines that use early prediction and equals 
the full cycle life otherwise. Finally, to account for the randomness at 
the beginning of the experiment (that is, round O when every ablation 
baseline randomly selects a protocol), we report the performance of 
each ablation baseline averaged over a sequence of 2,000 randomly 
initialized experiments. To specify the y-axis of Fig. 4e, we assume that 
each full cycle (charging, discharging, resting) requires one hour of 
experimental testing. 


Overpotential analysis 

To determine the dependence of overpotential on current and SOC 
during charging (Extended Data Fig. 2e-f), we perform a pseudo-galva- 
nostatic intermittent titration technique experiment ontwo minimally 
cycled batteries and two degraded batteries (80% of nominal capacity 
remaining). We probe currents ranging from 3.6C to 8C at 20%, 40%, 
60% and 80% SOC, mirroring the current and SOC values used in charg- 
ing protocol design. In this experiment, we start at an initial SOC 20% 
lower than the target, for example, we start at 0% SOC to probe 20% 
SOC. We then charge at a given current rate, for example, 3.6C, until 
we reach 20% SOC. The cell rests for 1h, and then the cell discharges at 
1C back to 0% SOC. We repeat this sequence for all current values, after 
which we charge the cell at 1C to the next initial SOC, for example, 20% 
SOC to probe 40% SOC, and repeat for each SOC of interest. 

To compute the overpotential, we compare the voltage at the start 
and end of the 1-h rest periods. Nearly all of the potential drop occurs 
immediately (<100 ms) after the start of the rest period. Given the 
linear trends observed (implying ohmic-limited rate capability), we 
then perform a linear fit on each overpotential-current series. In these 
fits, the slope represents the ohmic resistance. 


Data availability 
The datasets used in this study are available at https://data.matr.io/1. 
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Arbin-schedule-file-creation. 
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Extended Data Fig. 1| Illustrations of early outcome predictor and BO 
components of CLO. a, Illustration of early outcome prediction for two cells 
(Aand B) using data from only the first 100 cycles. Two discharge capacity 
features are generated: the second-cycle discharge capacity, Q,.,and the 
difference between the maximum and second-cycle discharge 

capacities, max(Q,) — Q,,. Three voltage features are generated: the logarithm 
ofthe minimum, variance and the skewness of the difference in voltage curves 
between cycles 100 and 10. These five features are combined ina linear model 
to predict the final cycle life, or the number of cycles until the capacity falls 
below 0.88 Ah. The weights and scalings of each feature are determined by 
training the model ona training set using the elastic net; the weights and 
scaling values are presented in Supplementary Table 1. See Severson etal.’ and 
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Methods for additional details. b, Illustration of the BO principle. The desired 
output, cycle life, has a true functional dependence on charging protocol 
parameters (such as CCl). Here, we showa one-dimensional model (thatis, just 
dependent on one parameter, CC1) for simplicity. By performing Gaussian 
process regression on the available data, we develop a probabilistic estimate of 
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Extended Data Fig. 2| Cell characterization. a, b, Voltage versus capacity 
during rate testing of A123 18650MI1A cylindrical cells under charge (a) and 
discharge (b). The (dis)charge step not under investigation is cycled at 1C to 
isolate the rate of each step; for example, the charge rate test is performed with 
1-C discharge steps. We note that the discharge rate capability is much higher 
than that of charge. c, d, Battery surface temperature (‘can temperature’) 
versus capacity during rate testing under charge (c) and discharge (d). The can 
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epoxy. e, f, Overpotential as a function of SOC and C rate (see Methods 

section ‘Overpotential analysis’ for details of the measurement) for a minimally 
cycled cell (e) and an aged cell at 80% of nominal capacity (f). The trend lines are 
linear fits of the overpotential as a function of current at fixed SOC (excluding 
outliers). We note that both of the relationships are linear (indicating that the 
rate capability is ohmically limited) and that the SOC dependenceis weak, 
particularly for the minimally cycled cell. The initial internal resistance, 
averaged over two cells and all four SOCs, is 33 mQ. 
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Extended Data Fig. 3 | Additional optimization results. a,b, Mean of the 
absolute difference in CLO-estimated cycle lives with increasing rounds, 
expressed as both percentage change (a) and absolute change (b). These 
changes are relatively small beyond round 2, suggesting that the closed loop 
can perform well with even smaller time or battery budgets. c, Change in 
Kendall rank correlation coefficient with increasing rounds. From round 3 to 
round 4, the ranking of the top protocols shifts, but the cycle lives of these top 
protocols are similar. d, Distribution of CLO-estimated mean cycle lives after 
round 4. The mean and standard deviation are 943 cycles and 126 cycles, 
respectively. e, Correlation between CLO-estimated mean cycle lives and the 
sum of squared currents, a simplified measure of heat generation (P=/R). This 
relationship suggests that minimizing heat generation, as opposed to avoiding 


125 
rank after round 4 


150 175 200 


lithium plating, may be the operative optimization strategy for these cells 
under these conditions. f, Standard deviation (0, ;) versus mean (1, ) of the BO 
predictive distribution over cycle life after round 4. The standard deviation 
quantifies the uncertainty in the cycle life estimates and is generally low for 
protocols estimated to have high mean cycle life, since these protocols are 
probed more frequently. We start with a relatively wide, flat prior (standard 
deviation 164) and therefore the uncertainty intervals after four rounds are also 
wide. g, Mean + standard deviation of the predictive distribution over cycle life 
after round 4 (1, ,+0,,) for all charging protocols, sorted by their rank after 
round 4. The legend indicates the number of repetitions for each protocol 
(excluding failed batteries). 
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Extended Data Fig. 4| See next page for caption. 


Extended Data Fig. 4 | Means and upper/lower confidence bounds 

(Hit B.O,,) on cycle life per round k. Protocol indices on the x-axis are sorted 
by rank after round 4. The weighted interval around the estimated mean, 
B.Ovi= (Boe) o,;, weights the protocol-specific standard deviation at round k, 
0, (estimated by the Gaussian process model) with the exploration tradeoff 
hyperparameter at round k, B,. The upper and lower confidence bounds are 
plotted for all charging protocols before round 1 (a) and after rounds 1 (b), 2 (c), 
3 (d) and 4 (e). The predictive distributions for all charging protocols have 
identical means and standard deviations before the first round of testing. 
Because the standard deviations are weighted by B, = Bye* and e=0.5, the 


weighted confidence bounds rapidly decrease with increasing round number, 
favouring exploitation (examination of protocols with high means). The BO 
algorithm recommends the 48 protocols with the highest upper bounds (red 
points); the upper boundsare high either due to high uncertainty (exploration) 
or high means (exploitation). The algorithm rapidly shifts from exploration to 
exploitation as €, rapidly shrinks the upper bounds with increasing round 
index. We note that one protocol per round that should have been selected 
(that is, with a top-48 upper bound) was not selected owing toa processing 
error; instead, the protocol with the 49th-highest upper bound was selected. 
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Extended Data Fig. 6 | Selected protocols for validation. The three protocols 
with the highest CLO-estimated mean cycle lives are shown in panels b, cand d. 
The protocols shown in panelsa, f, g and hare approximations of previously 
proposed battery fast-charging protocols (Extended Data Table 1). The 
remaining two protocols, shown in panels e andi, were selected to obtaina 
representative sampling from the entire distribution of CLO-estimated cycle 
lives. The annotations on each panel represent the cycle lives of each protocol 


as estimated by CLO (‘CLO’), early outcome prediction from validation (‘Early 
prediction’), and the final cycle lives from validation (‘Final’). Inthe 
annotations, the errors represent the CLO-estimated standard deviation after 
round 4 (0,,) for the CLO-estimated cycle lives and the 95% confidence intervals 
for the early-predicted and final cycle lives from validation (n=5;n=4 for the 
early predictions of 3.6C-6.0C-5.6C-4.755C) (a). 
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Extended Data Fig. 7 | Validation ablation analysis. We perform pairwise 
comparisons of the cycle lives of the nine validation protocols, as estimated 
from three sources: closed-loop estimates after four rounds, early predictions 
from the validation experiment and final cycle lives from the validation 
experiment. Panels a-~c compare closed-loop estimates to early predictions 
from validation, panels d-f compare final cycle lives from validation to early 
predictions from validation, and panels g-i compare final cycle lives from 
validation to closed-loop estimates. The first column (a, dand g) compares 
cycle lives averaged ona protocol basis; the second column (b, eandh) 


compares cycle lives ona battery (cell) basis; and the third column (c, fandi) 
compares the predicted ranking by cycle life viaeach method. Orange points 
represent the top three CLO-estimated protocols, blue points represent 
protocols inspired by the battery literature (Methods), and green points 
represent protocols selected to sample the distribution of estimated cycle 
lives. The error bars represent the 95% confidence intervals (n=5;n=4 forthe 
early predictions of 3.6C-6.0C-5.6C-4.755C). The Pearson correlation 
coefficient and Kendall rank correlation coefficients are displayed for all 
relevant cycle life and ranking plots, respectively. 
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Extended Data Fig. 8 | Closed-loop performance under resource constraints. 
Comparison of the closed loop with and without the Bayesian optimization 
algorithm (that is, with and without the explore/exploit component) asa 
function of number of channels and number of rounds inthe 224-protocol 
space, using the first-principles simulator as the ground-truth source for cycle 
lives. Early predictionis not included. Each point represents the mean of 100 


simulations; error bars represent the 95% confidence intervals (n=100). Early 
prediction is not incorporated into these simulations. The complete closed 
loop (that is, with Bayesian optimization) consistently outperforms the closed 
loop without Bayesian optimization. Bayesian optimization offers the largest 
advantage when the number of channels is low relative to the number of 
protocols. 
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Extended Data Fig. 9 | Hyperparameter sensitivity analysis ona cycle life 
simulator. The true cycle life of the best charging protocol as estimated by 
CLO, averaged over ten random seeds, is plotted as a function of the initial 
exploration constant (,), the exploration decay factor (€) and the kernel 
bandwidth (y). The values of all other hyperparameters are consistent with the 
values indicated in the ‘BO hyperparameter optimization’ Methods section and 


in Supplementary Table 5. Overall, CLO achieves acceptable performance over 
a range of hyperparameter combinations; the highest-cycle-life protocols as 
estimated by the best and worst hyperparameter combinations differ by only 
60 cycles. In our real-world CLO experiment, the selected hyperparameters are 
By=5.0,€=0.5 and y=1; this combination performed well ona variety of 
simulated parameter spaces and budgets. 


Extended Data Table 1| Selected charging protocols for validation 
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The columns represent the CLO-estimated mean cycle lives of each protocol, early predictions in the validation experiment and the final tested cycle lives. For the CLO-estimated cycle lives, 
the errors represent the CLO-estimated standard deviation after round 4 (0,,,). For the early-predicted and final cycle lives from validation, the errors represent 95% confidence intervals (n = 5; 
but n = 4 for the early predictions of 3.6C-6.0C-5.6C-4.755C). The two protocols without a source were selected to obtain a representative sampling from the distribution of CLO-estimated cycle 
lives. Literature fast-charging protocols are from refs. °°". 
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Glycans have diverse physiological functions, ranging from energy storage and 
structural integrity to cell signalling and the regulation of intracellular processes’. 


Although biomass-derived carbohydrates (such as D-glucose, D-xylose and D- 
galactose) are extracted on commercial scales, and serve as renewable chemical 
feedstocks and building blocks*’, there are hundreds of distinct monosaccharides 
that typically cannot be isolated from their natural sources and must instead be 
prepared through multistep chemical or enzymatic syntheses**. These ‘rare’ sugars 
feature prominently in bioactive natural products and pharmaceuticals, including 
antiviral, antibacterial, anticancer and cardiac drugs®’. Here we report the 
preparation of rare sugar isomers directly from biomass carbohydrates through site- 
selective epimerization reactions. Mechanistic studies establish that these reactions 
proceed under kinetic control, through sequential steps of hydrogen-atom 
abstraction and hydrogen-atom donation mediated by two distinct catalysts. This 
synthetic strategy provides concise and potentially extensive access to this valuable 
class of natural compounds. 


Simple structural and storage polymers including starch, cellulose, and 
hemicellulose are important sources of the monosaccharide feedstocks 
D-glucose, D-xylose, D-galactose, D-mannose and L-arabinose (Fig. 1a). 
Isomerization is a useful strategy for the synthesis of so-called ‘rare’ 
sugars (Fig. 1b) from biomass precursors; however, these processes 
remain challenging owing to the structural and stereochemical com- 
plexity of sugars. Chemical isomerization reactions (for example, the 
Lobry de Bruyn-Alberda van Ekenstein and Bilik reactions) are typi- 
cally unselective, leading to complex thermodynamic distributions 
of products and often intractable separations (Extended Data Fig. la)®. 
In contrast, enzymatic methods offer an added level of precision and 
have emerged as a powerful synthetic alternative to chemical strate- 
gies’. Enzymatic isomerizations feature prominently in industrial sugar 
processing, including in the syntheses of D-fructose and D-ribose from 
D-glucose”. In principle, multistep enzymatic synthesis also provides 
synthetic access to rare hexose, pentose and tetrose isomers but low 
yields and prohibitive production costs nonetheless limit implementa- 
tion of these strategies”. For example, D-allose has potential value as 
alow-glycemic sweetener and shows promising anti-inflammatory and 
immunosuppressive activity. The enzymatic synthesis of D-allose can 
be achieved in overall 2.5% yield from D-glucose through sequential 
treatment with D-xylose isomerase (50% yield), D-tagatose 3-epimerase 
(20% yield), and L-rhamnose isomerase (25% yield allose + 8% yield 
altrose) (Extended Data Fig. 1b). Like chemical isomerizations, nearly 
all enzymatic isomerizations proceed through reversible polar enoli- 
zation mechanisms under equilibrium control. Maximum product 
yields are therefore dictated by thermodynamic considerations under 
reaction conditions constrained by temperature-dependent enzyme 
activity. Reaction scope is also mechanistically restricted: for example, 
2-deoxygenated sugars cannot be substrates under enolization-based 
isomerization conditions. 


We envisioned that rare sugar isomers could be obtained directly 
from biomass-derived carbohydrates through site-selective radi- 
cal epimerization reactions. A kinetically controlled epimerization 
process would require C—H bond cleavage and C-H bond formation 
steps to proceed through distinct mechanisms but could in princi- 
ple afford product yields and selectivities exceeding those observed 
under canonical equilibrium-controlled sugar isomerization condi- 
tions. Our approach draws inspiration from recent reports of enzy- 
matic radical epimerizations mediated by the cofactor S-adenosyl 
methionine (SAM)'©””. For example, in the biosynthesis of neomycin 
Bfrom neomycin C, a high-energy 5’-deoxyadenosyl radical abstracts 
the C5 hydrogen atom from the terminal saccharide of neomycin 
C (Fig. 1c). Subsequent re-delivery of a hydrogen atom tothe opposite 
face is achieved via a pendant cysteine thiol. Within the enzyme active 
site, diastereoselective hydrogen-atom transfer (HAT) steps are thus 
promoted through exquisite spatial control over the hydrogen-atom 
donor/acceptor pairs. The architecture of the enzyme further miti- 
gates any chemical incompatibility of the 5’-deoxyadenosyl radical and 
cysteine thiol. Outside the physical context of an enzyme, however, 
other interactions would be necessary to achieve kinetic control and 
prevent reagent donor/acceptor quenching. Although several notable 
synthetic HAT-mediated epimerization reactions have been developed, 
these methods almost universally proceed through reversible HAT 
steps, affording equilibrium product distributions’. However, a der- 
acemization reaction recently reported by Knowles has established the 
conceptual viability of contra-thermodynamic radical isomerization 
through sequential proton-coupled electron transfer and HAT steps”. 

In addition to kinetic challenges posed by employing chemi- 
cally incompatible reagents, the success of this transformation 
further requires numerous nearly identical C-H bonds to be distin- 
guished within the context of a highly polar, densely functionalized, 
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Fig. 1| Approaches to the epimerization of sugars. a, Only a handful of 
monosaccharides can be obtained from natural sources. b, Hundreds of rare 
sugars exist and feature prominently in glycosylated natural products. 

c, Radical SAM-dependent epimerization by NeoN of neomycin C involves 


stereochemically complex unprotected carbohydrate substrate. 
Indeed, while site-selective functionalization of carbohydrate O-H 
bonds has been the subject of considerable recent attention”, exam- 
ples of selective carbohydrate C-H oxidation and functionalization are 
extremely limited”. Precedents set by the Minnaard, Waymouth and 
Muramatsu laboratories have established the feasibility of site-selective 
oxidation of minimally-protected monosaccharides to keto-sugars 
using Pd(11)/benzoquinone, Pd(II)/O,, and Sn(1v)/Br, catalyst systems, 
respectively **, Recent work by Minnaard has demonstrated that dias- 
tereoselective C-H alkylation of glucose derivatives can be achieved 
through site-selective hydrogen-atom abstraction by a quinuclidinium 
radical cation generated under photoredox conditions”. Taylor has 
further expanded the scope of this transformation by employing a 
borinic acid co-catalyst to promote stereoretentive C-H alkylation of 
cis-diol-containing monosaccharides”. Building on these findings, 
here we report a strategy for the synthesis of rare sugar isomers from 
biomass-derived precursors through the site-selective epimerization 
of unprotected sugars and glycans (Fig. 1d). 

After extensive exploration of reaction conditions, the mini- 
mally protected model substrate, a-methylglucose, was found to 
react under photochemical conditions to afford a-methylallose 
as the sole reaction product in 92% yield within 3 h (Fig. 2). 
Optimal reaction conditions employ catalytic quantities of 
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sequential HAT from substrate to the 5’-deoxyadenoxyl radical, and from 
cysteine thiol (Cys-249-SH) to the substrate. (Met, methionine.) d, Proposed 
direct chemical epimerization of biomass sugars to the rare sugars described 
here. 


1,2,3,5-tetrakis(carbazol-9-yl)-4,6-dicyanobenzene (4-CzIPN), quinu- 
clidine, adamantane thiol and tetrabutylammonium p-chlorobenzo- 
ate in MeCN/DMSO at room temperature under blue light irradiation 
(see Supplementary Fig. 1). No epimerization was observed to occur 
in the absence of photocatalyst, thiol or light, and only trace prod- 
uct (<1%) was observed in the absence of quinuclidine. The reaction 
yield was considerably diminished in the absence of benzoate base 
(29% yield) or by employing tetrabutylammonium benzoate as the 
base (63% yield). Ir[(dF(CF;)ppy),.(dtbpy)]PF, (IrF) is also an effec- 
tive photoredox catalyst for this transformation, promoting the 
desired reaction at only 1 mol% loading. However, the observation of 
oxidation products, as well as considerable cost differences ($5 per 
mmol for 4-CzIPN versus $935 per mmol for IrF; dollar prices are in 
US$ in 2020) led us to select 4-CzIPN as the preferred reagent. Alkyl 
thiols were uniquely effective at promoting epimerization: no reac- 
tion was observed using thiophenols or thiobenzoic/thioacetic acid 
derivatives, nor when using any other hydrogen-atom donor surveyed 
(see Supplementary Information section 8). 

Although the basis for site-selectivity is not yet fully understood, 
the C3 selectivity observed here is congruent with the substrate-con- 
trolled selectivity previously noted by both Waymouth and Minnaard 
in oxidation reactions”. Nuclear magnetic resonance (NMR) titra- 
tion experiments reveal an equilibrium interaction between 1a and 
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Fig. 2 | Epimerization of a-methylglucose to a-methylallose. Effect of 
changes to optimized reaction conditions. Yields determined by proton 
nuclear magnetic resonance (1H NMR) analysis using 4-fluoroanisole as 
internal standard. RSM, recovered starting material; MeCN, acetonitrile; 
DMSO, dimethylsulfoxide; RT, room temperature; LED, light-emitting diode; 
Me, methyl; DABCO, 1,4-diazabicyclo[2,2,2]octane; Ad-SH, adamantane thiol; 
Bz, benzoyl.*See Supplementary Information sections 5 and 8 for full 
experimental details. 


tetrabutylammonium p-chlorobenzoate with alteration of substrate 
‘\(c- coupling constants, implicating the presence of hydrogen-bond- 
ing interactions between la and the base with concomitant weakening 
of aC-Hbonds”. In contrast, no such interaction is observed between 
pentamethylated glucose and base, and importantly, no epimeriza- 
tion products were detected in reactions employing permethylated 
or peracetylated substrates. 

Arange of biomass-derived and abundant monosaccharides were 
evaluated as substrates under the optimized reaction conditions 
(Fig. 3). This strategy provides synthetic access to 4 of the 5 rare hexose 
isomers. In addition to D-allose products obtained from glucose-config- 
ured substrates, D-talose (2b), D-gulose (2c) and D-altrose (2d) products 
are obtained selectively from the reaction of B-methylgalactose, anhy- 
drogalactose and anhydromannose, respectively. The biomass-derived 
pentose sugars D-a-methylxylose (1f) and L-B-methylarabinose (Le) 
afforded D- and L-ribose derivatives through C3 and C2 epimerization, 
respectively. Although D-ribose is accessible on a large scale through 
glucose fermentation, synthetic access to L-ribose remains extremely 
limited”. Despite the presence of an electron-rich acetamide substitu- 
ent, the N-acetylglucosamine derivative 1g also undergoes productive 
epimerization, reacting to afford a1.5:1 mixture of the C3- and C4-epi- 
meric products, N-acetylallosamine, 2g, and N-acetylgalactosamine, 
3g, respectively. 

Completely unprotected monosaccharides also undergo selective 
epimerization under these conditions: for example, 42% yield D-allose 
2k is obtained from D-glucose (compared with a 2.5% total yield over 


4steps, using enzymatic methods), and 55% yield L-6-deoxytalose 2m 
is obtained from the reaction of L-fucose. The reaction of D-2-deoxy- 
glucose affords D-2-deoxyallose in 61% yield: importantly, epimeriza- 
tion of 2-deoxygenated sugars cannot be realized using alternative, 
enolization-based isomerization methods. 

More complex glycans were subsequently evaluated to assess fur- 
ther the selectivity and functional group compatibility of the reaction 
conditions. Allosucrose, 20, can be obtained selectively from sucrose 
in 68% yield, and despite the presence of 14 stereogenic centres, the 
reaction of raffinose, Ip, affords a singly epimerized product, 2p, in44% 
yield. Pyrimidine-containing pyranonucleoside, 1q, reacted to afford 
the C3-epimeric product, 2q. Finally, the C-glycoside SGLT2 inhibi- 
tor empagliflozin was also examined as a substrate, and was found to 
afford the C3-epimeric product, 2r, in 42% yield with no observation 
of epimerization at any other position in the molecule. 

We performed a series of experiments to gather insight into the 
underlying mechanism of this transformation. Re-subjecting 6-deoxy- 
B-methyltalose, 2h, (formed in 70% yield from B-methylfucose, lh) to 
standard reaction conditions did not result in the formation of any 
B-methylfucose starting material (Fig. 4a). A similar experiment was 
conducted using a-methylallose and adamantane thiol-d1 (where 
‘dl indicates deuterium isotopic substitution). After 16 hours, 95% 
a-methylallose was recovered with 39% deuterium incorporation at 
the C3 position; no glucose products were detected. This experiment 
demonstrates that both a-methylglucose and a-methylallose can 
undergo hydrogen-atom abstraction but that both converge to the 
a-methylallose product (Fig. 4b). Taken in conjunction with estab- 
lished thermochemical data, these experiments provide preliminary 
evidence that these transformations do not proceed under simple 
equilibrium control”®. 

To explore further the individual elementary steps of this reaction, 
Stern-Volmer fluorescence quenching was examined under two dif- 
ferent sets of conditions (see Supplementary Information section 9). 
Preliminary experiments reveal that quinuclidine efficiently quenches 
the photocatalyst excited state, whereas adamantane thiol does not 
quench under the conditions examined. In the presence of both quinu- 
clidine and adamantane thiol, quenching kinetics are identical to the 
‘quinuclidine only’ conditions. These findings support a mechanism in 
which the excited photocatalyst is quenched by quinuclidine to gener- 
ate quinuclidinium radical cation”’. When catalyst loading is increased 
to 3 mol% and the reaction is run in the absence of thiol, small quanti- 
ties (3% yield) of 3-keto sugar are obtained (Fig. 4c). Importantly, no 
epimerizationis observed under these conditions, nor under any condi- 
tions tested where alkyl thiol is not present in the reaction mixture (see 
Supplementary Information section 9). These experiments establish 
that photocatalyst and quinuclidine are sufficient for C-H cleavage 
to occur but insufficient for epimerization. Together, they support 
a mechanism in which the quinuclidinium radical cation mediates an 
irreversible hydrogen-atom abstraction step. 

To probe the role of the thiol co-catalyst, an analogous set of fluores- 
cence quenching experiments was carried out using 4-bromothiophe- 
nol in the place of adamantine thiol. Under these reaction conditions 
no epimerization or consumption of a-methylglucose is observed. As 
with adamantane thiol, minimal fluorescence quenching was observed 
using 4-bromothiophenol alone. However, in the presence of both 
4-bromothiophenol and quinuclidine, a substantial increase in fluores- 
cence quenching was observed relative to the case with quinuclidine 
alone (see Supplementary Information section 9). We postulated that 
this enhanced fluorescence quenching might be due to the oxidation of 
thiolate—generated in situ by deprotonation of thiol by quinuclidine, 
or through proton-coupled electron transfer from a quinuclidine/thiol 
hydrogen-bonded complex—to the corresponding thiyl radical. Indeed, 
sodium thiophenolate was also found to quench the photocatalyst ata 
tenfold-higher rate than quinuclidine alone, and NMR titration studies 
performed under comparable conditions identified a small equilibrium 
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Fig. 3 | Site-selective epimerization of saccharides and glycans. Scope of the 
reaction with respect to substrate. Reactions were conducted at 0.3-mmol 
scale in duplicate; the average 'H NMR yield is reported; duplicate reactions 
were combined, then isolated; isolated yields are in parentheses. *, Isolated in 


406 | Nature | Vol578 | 20 February 2020 


2 mol% 4-CzIPN 
10 mol% quinuclidine 


50 mol% Ad-SH 


25 mol% 4-ClOBzBu4N 
MeCN/DMSO, RT, 16h 


“Ee y one 
OH 
HO~" OH 


2h: L-6-deoxy-B-methyltalose 


ooo 


OMe | 

Me @) OMe "eZSLon i 
OH 1 OH ' 

HO ! t 


OH 
2h: 83% recovered 


blue LED 
b Sata eens | 
' 4a Not observed ' 
2 mol% 4-CzIPN ! H 
DO 10 mol% quinuclidine DO ‘DO : 
DO 10) 50 mol% Ad-SD DO fe) : DO [e) ! 
——- ee ee D t Do 1 
25 mol% 4-ClOBzBu,N : i! 
DO DO ! DO H 
DO OMe MeCN/DMSO, RT, 16h DO OMe ' H,D OMe |: 
2a blue LED 39% deuterium incorporated Nae eenenee enn e nee n net 
2a: 95% recovered 
c | 2a Not observed ' 
No thiol : 
HO 2 mol% 4-CzIPN HO 1 HO H 
HO ie) 10 mol% quinuclidine HO 10) ' HO 10) ‘ 
HO Polit dues ee ! 1 
25 mol% 4-ClOBzBu4N 1 i 
HO HO f HO i 
OMe MeCN/DMSO, RT, 16h o OMe \ HO OMe : 
blue LED 3% yield Terence eeeeeceee i 
la 1a: 97% recovered 
d Blue LED 
* B H* 
[PC] 
RaN RS (PCI 
[Po H atom = H atom [Pcre4} 
abstraction donation 
R3N** R3NH* R. 


HO HO -SH RS" HO 
HO [@) ~~ A HO Oo _ A HO 10) 
HO ) H 
HO 
H HO OH Ho HO ‘oH ma OH 


b-glucose 
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observed when reaction products are re-subjected to the standard reaction 
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interaction (K,,= 42M”) between 4-bromothiophenol (pK, = 9.0 in 
DMSO) and quinuclidine (quinuclidinium conjugate acid, pK, = 9.8 in 
DMSO), supporting the formation of thiolate in solution. 

These experiments indicate that thiol acidity is animportant param- 
eter distinguishing productive versus unproductive reaction condi- 
tions. Inthe presence of quinuclidine, acidic thiols can be deprotonated 
to form thiolate salts. Preferential thiolate quenching of the photo- 
catalyst results in the formation of thiyl radicals. No epimerization was 
observed under these, or any other photo-oxidative, photo-reductive 
and thermal conditions that we explored for the in situ generation of 
thiyl radical (see Supplementary Information section 9). These find- 
ings suggest that thiyl radical is not competent for hydrogen-atom 
abstraction. Accordingly, the thiol co-catalyst is instead implicated in 
a subsequent, irreversible HAT to the incipient sugar radical. 

Collectively, the mechanistic studies presented here support 
anon-equilibrium epimerization mechanism proceeding through 
two sequential and distinct HAT steps: hydrogen-atom abstraction 
by quinuclidinium radical cation (homolytic bond dissociation 
enthalpy = 100 kcal mo!) from substrate, followed by HAT from thiol 
(87 kcal mol) to the incipient sugar radical (Fig. 4d). Although both 
substrate and product can undergo hydrogen-atom abstraction by 
quinuclidinium radical cation, mechanistic data are consistent with 
irreversible hydrogen-atom abstraction followed by diastereoselective 
HAT from thiol. Attendant to a kinetically controlled epimerization 
mechanism, the reaction yields and product selectivities presented 
here exceed nearly all other direct isomerization yields reported so far, 
which have reflected thermodynamic product distributions. 


p-allose 


common product.c, Reaction inthe absence of thiol donor affords no 
isomerization product, implicating irreversible hydrogen-atom abstraction by 
the quinuclinium radical cation. d, Plausible mechanistic pathway for selective 
isomerization reactions. 
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Atmospheric methane (CH,) is a potent greenhouse gas, and its mole fraction has 
more than doubled since the preindustrial era’. Fossil fuel extraction and use are 


among the largest anthropogenic sources of CH, emissions, but the precise 
magnitude of these contributions is a subject of debate”. Carbon-14 in CH, (“CH,) 
can be used to distinguish between fossil (‘*C-free) CH, emissions and 
contemporaneous biogenic sources; however, poorly constrained direct “CH, 
emissions from nuclear reactors have complicated this approach since the middle of 
the 20th century*”. Moreover, the partitioning of total fossil CH, emissions (presently 
172 to 195 teragrams CH, per year)”* between anthropogenic and natural geological 
sources (such as seeps and mud volcanoes) is under debate; emission inventories 
suggest that the latter account for about 40 to 60 teragrams CH, per year®”. 
Geological emissions were less than 15.4 teragrams CH, per year at the end of the 
Pleistocene, about 11,600 years ago’, but that period is an imperfect analogue for 
present-day emissions owing to the large terrestrial ice sheet cover, lower sea level 
and extensive permafrost. Here we use preindustrial-era ice core “CH, measurements 
to show that natural geological CH, emissions to the atmosphere were about 1.6 
teragrams CH, per year, with amaximum of 5.4 teragrams CH, per year (95 per cent 
confidence limit)—an order of magnitude lower than the currently used estimates. 
This result indicates that anthropogenic fossil CH, emissions are underestimated by 
about 38 to 58 teragrams CH, per year, or about 25 to 40 per cent of recent estimates. 
Our record highlights the human impact on the atmosphere and climate, provides a 
firm target for inventories of the global CH, budget, and will help to inform strategies 
for targeted emission reductions”. 


Atmospheric measurements of carbon-13 in methane (8"CH,) have 
been used to estimate the fossil fraction of the contemporaneous CH, 
budget’. This approach relies on having accurate estimates of the 5°C 
signatures of the major CH, source categories (fossil, microbial and 
biomass burning) and the strength of the biomass burning source. 
Large uncertainties in these parameters in the past preclude accurate 
6°CH,-based estimates of preindustrial-era fossil CH, emissions®™?, 
Radiocarbon (4C) is an ideal tracer for quantifying the fossil component 
of the atmospheric CH, budget because all “C in fossil CH, has decayed. 
By contrast, biogenic CH, sources (wetlands, biomass burning) have a 
*C activity similar to that of contemporaneous atmospheric CO, (ref. 
48) Interpretation of atmospheric“CH, measurements from 1987-2000 
suggests that the fossil fraction of the contemporary CH, budget is 


30 + 2.3% (ref. 8; 10). However, the interpretation of atmospheric “CH, 
inrecent decades has been complicated by (1) rapidly changing atmos- 
pheric “CO, (from above-ground nuclear testing and fossil fuel emis- 
sions) that propagates into biospheric CH, emissions”, and (2) direct 
™CH, emissions from nuclear power plants*>. By contrast, palaeoatmos- 
pheric “CH, measurements from ice cores offer a direct constraint on 
natural geological CH, emissions without these complications. Whereas 
geological CH, emissions have the potential to change on tectonic- and 
glacial-cycle timescales“, they have very probably been constant over 
the past few centuries. The preindustrial-era emission estimates can 
therefore be applied to the modern CH, budget with confidence. 

Ice core “CH, analysis is challenging owing to both the very large 
sample requirement (-1,000 kg of ice) and interference from in situ 
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cosmogenic “C production within the ice crystal lattice. We address 
the former by using a large-diameter ice drill and a large-volume ice- 
melting apparatus (Supplementary Information section 1) to obtain 
sufficient CH, (-20 pg C) for “C analysis by accelerator mass spec- 
trometry. To address the latter, we follow the established®’ approach 
of analysing “C of carbon monoxide (CO) in parallel with *CH,. “CO is 
very sensitive to in situ cosmogenic “C production” and can be used 
to precisely establish the effective cosmic ray exposure history of each 
sample. We then correct the “CH, data using the known in situ cosmo- 
genic“CH,/“CO production ratio inice” (Supplementary Information 
sections 5, 6). The in situ cosmogenic “CH, component inthe samples 
used in this study is much lower (<2% of total “CH,) than in ablation- 
zone ice used in previous palaeoatmospheric “CH, studies (-30% of 
total “*CH,)*”*. We present new “CH, data from large-volume ice core 
samples and firn air sampling from Summit, Greenland, which we com- 
bine with prior firn air “CH, measurements from Law Dome DSSW20K* 
and Megadunes”, Antarctica. Our combined record spans from about 
1750 to 2013 and captures the evolution of atmospheric “CH, since the 
preindustrial era (Fig. 1). The movement of gases within the firn and 
closure into bubbles is characterized using a firn air transport model’, 
and the time series of atmospheric “CH, is reconstructed using a matrix 
inversion technique’??? (Supplementary Information section 9). 

Our atmospheric “CH, reconstruction (Fig. 1) is indistinguishable 
from the “CO,-derived contemporaneous biogenic “CH, signature 
(blue curve, Supplementary Information section 10) before 1880, sug- 
gesting very low natural geological CH, emissions. Atmospheric “CH, 
began to decrease around 1880, coincident with substantial increases 
in the use of coal, oil and natural gas (Fig. 2)”. The precise timing of 
the “CH, minimum (in the 1940s in our reconstruction) is difficult to 
establish owing to the broad age distributions of individual firn air 
and ice core samples, as well as the smoothing applied by the matrix 
inversion technique to address the non-uniqueness of the solution”. 
Beyond this fossil *C minimum, our samples are affected by the propa- 
gation of “C from atmospheric nuclear testing into the carbon cycle” 
and by emissions from nuclear power plants (starting in the 1970s), 
which drove a sustained “CH, increase despite decreasing “CO,*>. We 
calculate the fossil CH, fraction and develop atime series of fossil CH, 
emissions (Fig. 2) using aone-box atmospheric model (Supplementary 
Information section 10). The broad age distributions of our air samples 
(Supplementary Fig. 3) result in a smoothed representation of the 
atmospheric “CH, history that cannot capture the abrupt increase 
of bomb “CO, (and subsequently “CH,) starting in 1955. Therefore, 
we interpret the fossil CH, fraction only before the 1940s. We find an 
increase in the total (geological plus anthropogenic) fossil emissions 
from negligible CH, emissions in the mid-19th century to 64.8 tera- 
grams CH, per year (Tg CH, yr“) in1940. 

Assuming that the oldest ice core “CH, sample in our reconstruc- 
tion (mean age 1756 AD; Fig. 1) is devoid of anthropogenic fossil CH, 
contributions, we use the contemporaneous biogenic “CH, source 
signature to calculate the natural geological CH, emissions during 
the preindustrial era: 1.6 Tg CH, yr‘ with a 95% confidence interval (CI) 
maximum of 5.4 Tg CH, yr! (Supplementary Information section 10, 
Supplementary Fig. 5). Our 95% confidence limit of 5.4 Tg CH, yr ‘agrees 
well with, and provides a tighter constraint than, the only other pub- 
lished “*CH,-based estimate of natural geological CH, emissions from 
ice cores, which sampled air from the most recent deglaciation (0 to 
15.4 Tg CH, yr“, 95% Cl range)®. 

Our result is much lower than estimates from recent source inven- 
tory (‘bottom-up’) studies typically used in global CH, budgets”, which 
suggest natural geological emissions of -40-60 Tg CH, yr“ (ref. °). A 
recent study’ aimed at developing gridded maps of geological CH, 
emissions revised this estimate downwards to 37 Tg CH, yr ‘on the 
basis of data and modelling specifically targeted for gridding; however, 
the CH, emissions increased to 43-50 Tg CH, yr ‘when extrapolated to 
account for temporal variability in mud volcano eruptions and onshore 
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Fig. 1| Reconstruction of atmospheric “CH, from firn air and ice core data. 

a, Global CH, mole fraction, [CH,], reconstructed from ice core, firn air and 
atmospheric measurements’. ppb, parts per billion. b, Reconstructed history 
of atmospheric A“CH, from firn air and ice core samples (this study). Dotted 
lines represent the 95% confidence range based onall calculated “CH, histories 
using three different inversion methods (Supplementary Information 

section 9). Ice core and firn air A“CH, measurements are shown at the meanage 
of the modelled air age distribution. Vertical error bars on the A“CH, data from 
each site represent the 2ouncertainty for each sample after corrections 
(Supplementary Information Tables 2, 6), and horizontal error bars represent 
+2A, where Ais the spectral width of the sample-air age distribution”°. We also 
plot the “CH, signature of the contemporaneous biogenic source (blue; 
Supplementary Information section 10). Our time series begins in 1850 
because the age distributions of the collected ice core samples have poor 
coverage of air from -1780 to 1850 (Supplementary Information section 10, 
Supplementary Fig. 3B). 


or submarine geological seeps that lack location-specific measure- 
ments. Natural fossil CH, emissions of about 40 Tg CH, yr‘ (out of total 
preindustrial-era CH, emissions of 215 Tg CH, yr‘; Supplementary Fig. 5) 
would result ina preindustrial-era A“CH, of around -185%o, whichis in 
clear disagreement with our data (1.5%o + 21.2%o, 20; Fig. 1). Bringing 
our “C results into agreement with the bottom-up estimates of natural 
fossil CH, emissions would require an order-of-magnitude larger cor- 
rection for in situ cosmogenic “CH,. This would in turn require either 
an order-of-magnitude higher “CO content in the sampled ice or an 
order-of-magnitude higher in situ “CH,/“CO production ratio; both 
of these possibilities are well outside the respective uncertainties. The 
added uncertainties arising from the in situ and procedural corrections 
to the measured “CH, are also too small to explain the disagreement 
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Fig. 2| Growth in fossil CH, emissions and fossil fuel consumption. 

a, Historical fossil fuel energy consumption”. b, Calculated total fossil CH, 
emission history (solid line) from the one-box model (Supplementary 
Information section 10). The dashed lines show the 95% confidence interval. 


(Supplementary Information section 10, Supplementary Information 
Table 8). 

Diffuse microseepage (24 Tg CH, yr‘), macro-seeps and mud 
volcanoes (8.1 Tg CH, yr’), submarine seepage (>7 Tg CH, yr“) and 
geothermal manifestations (5.7 Tg CH, yr) represent the main catego- 
ries of natural geological CH, emissions in the latest comprehensive 
bottom-up analysis’. Each of these four categories is nearly equiva- 
lent to, or exceeds, our upper bound (at 95% confidence) on the total 
preindustrial-era geological CH, emissions (5.4 Tg CH, yr“). Emission 
estimates for diffuse microseepage are based on limited flux-chamber 
measurements in regions of known gas seepage (for example, ref. 7°), 
whichare scaled up toa global flux estimate based on the total dryland 
area situated above hydrocarbon reservoirs (~10% of Earth’s total land 
surface area), the percentage of measurements that show a positive 
flux, and emission rates chosen on the basis of several geological fac- 
tors’. Itis possible that the uncertainties associated with such global 
upscaling are much larger than reported, resulting in an overestima- 
tion by an order of magnitude or more. Similarly, emission estimates 
from macro-seeps, mud volcanoes and geothermal manifestations 
are derived from limited observations, which are scaled up toa global 
total’. To provide a sense of scale for the extrapolation in the case 
of mud volcanoes, ~0.0026 Tg CH, yr‘ of measured CH, emissions 
(table S2 in the supplement of ref. ”) are scaled up to 6.1 Tg CH, yr 
(table 2 in ref. ’). 


With regard to submarine seepage, recent studies suggest that CH, 
emissions to the atmosphere are probably very low owing to rapid 
dissolution of rising bubbles” and rapid oxidation of dissolved CH, 
(ref. 5), Furthermore, “CH, measurements in surface waters indicate 
minimal quantities of fossil CH, even in shallow waters over areas of 
active seeps or methane hydrate degradation”®. Our atmospheric 
4CH, measurements for the preindustrial era indicate that either 
(1) the uncertainties associated with global upscaling of geological 
emissions from discrete measurements result in overestimation by 
an order of magnitude, or (2) geological CH, emissions quantified 
by these measurements were not present in the preindustrial era 
and may have been triggered by fossil fuel extraction from hydro- 
carbon reservoirs or other anthropogenic activity such as ground- 
water aquifer depletion. If the latter is true, such emissions cannot 
be considered natural. 

A recent study’ used ice core 6’°CH, measurements to arrive at a 
natural geological CH, emission estimate that is on par with what is 
indicated by bottom-up methods (-50 Tg CH, yr). However, ref. ® 
showed that ice core 8°CH, data do not provide a strong constraint 
on preindustrial-era geological emissions and are also compatible 
with a minimal geological source. Measurements of ethane” in ice 
cores have also been used to suggest considerable emissions of fossil 
CH, during the preindustrial era. However, this is also an ambiguous 
constraint because ice core measurements of ethane mole fraction 
cannot discriminate between contributions from biomass burning (a 
major source) and natural geological emissions”. Our preindustrial-era 
“CH, measurements, by contrast, place an unambiguous constraint on 
natural fossil CH, emissions by directly recording the “C-free fraction 
of atmospheric CH,. 

Our “CH, reconstruction does not allow accurate quantification of 
the post-1950 fossil CH, budget, owing to relatively poor constraints 
onthe interfering nuclear “CH, sources. Previous work used atmos- 
pheric 6°CH, measurements to quantify the global fossil CH, source 
in recent decades’, but relied on inventory-based assessments to 
constrain the natural geological component. We combine our “CH,- 
based constraint on natural geological emissions (1.6 Tg CH, yr’) 
with 6°CH,-based constraints on the total fossil source (following 
the same one-box model approachas ref. *; Supplementary Informa- 
tion section 11) to estimate recent anthropogenic fossil CH, emis- 
sions. This approach yields 177 +37 Tg CH, yr“ (1o) for anthropogenic 
fossil CH, emissions during 2003-2012. Our estimate is 22% higher 
than the previous estimate of 145 + 23 Tg CH, yr“ (10) over the same 
interval?, and 33-55% higher than the range of bottom-up estimates 
(114-133 Tg CH, yr‘; ref. 7). We note that our 8°CH,-based calcula- 
tion uses an updated value for the CH, sink isotopic fractionation 
(Supplementary Information section 11); if we use the same value as 
ref.°, the anthropogenic fossil source estimate is 194 + 34 Tg CH, yr! 
for the same time period. 

Our results indicate that bottom-up inventories strongly underesti- 
mate CH, emissions from fossil fuel extraction, distribution and use. 
A study using both ground-based facility-scale measurements and 
verification from aircraft sampling found that US oil and natural-gas 
CH, emissions (largely from the production and gathering industry 
segments) are ~60% higher than those reported by the US Environ- 
mental Protection Agency”, one of the primary data sources used in 
bottom-up inventories’. If we consider a scenario in which the global 
bottom-up emissions of fossil CH, from the oil and natural-gas indus- 
tries (79 Tg CH, yr‘; ref. *) are similarly underreported by 60%, this 
would amount to unreported emissions of -47 Tg CH, yr“, which is in 
agreement with the fossil CH, emission shortfall that we identify in the 
current generation of bottom-up inventories (44-63 Tg CH, yr“). Our 
results imply that anthropogenic fossil CH, emissions now account 
for about 30% of the global CH, source and for nearly half of anthropo- 
genic emissions, highlighting the critical role of emission reductions 
in mitigating climate change®””. 
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The mammalian claustrum, owing to its widespread connectivity with other forebrain 


structures, has been hypothesized to mediate functions that range from decision- 
making to consciousness’. Here we report that ahomologue of the claustrum, 
identified by single-cell transcriptomics and viral tracing of connectivity, also exists in 
areptile—the Australian bearded dragon Pogona vitticeps. In Pogona, the claustrum 
underlies the generation of sharp waves during slow-wave sleep. The sharp waves, 
together with superimposed high-frequency ripples’, propagate to the entire 
neighbouring pallial dorsal ventricular ridge (DVR). Unilateral or bilateral lesions of 
the claustrum suppress the production of sharp-wave ripples during slow-wave sleep 
ina unilateral or bilateral manner, respectively, but do not affect the regular and 
rapidly alternating sleep rhythm that is characteristic of sleep in this species*. The 
claustrum is thus not involved in the generation of the sleep rhythm itself. Tract 
tracing revealed that the reptilian claustrum projects widely to a variety of forebrain 
areas, including the cortex, and that it receives converging inputs from, among 
others, areas of the mid- and hindbrain that are known to be involved in wake-sleep 
control in mammals* °°. Periodically modulating the concentration of serotonin inthe 
claustrum, for example, caused a matching modulation of sharp-wave production 
there and in the neighbouring DVR. Using transcriptomic approaches, we also 
identified a claustrum in the turtle Trachemys scripta, a distant reptilian relative of 
lizards. The claustrum is therefore an ancient structure that was probably already 
present in the brain of the common vertebrate ancestor of reptiles and mammals. It 
may have an important role in the control of brain states owing to the ascending input 
it receives from the mid- and hindbrain, its widespread projections to the forebrain 
and its role in sharp-wave generation during slow-wave sleep. 


Slow-wave sleep and rapid-eye-movement (REM) sleep are the two main 
macroscopic components of electrophysiological sleep in mammals 
and birds* ®, although some mammals may lack REM sleep’. The recent 
finding of alternating slow-wave and REMsleep ina reptile, the Austral- 
ian bearded dragon Pogona vitticeps’, suggests that these two modes 
of sleep may predate the diversification of amniotes 320 million years 
ago. Sleep in Pogona is particularly interesting because the sleep cycle 
of this reptile is very short (3 minutes or less at room temperature), and 
is divided equally into slow-wave sleep and REM sleep’. 

The dominant electrophysiological feature of Pogona slow-wave sleep 
is energy inthe 6 frequency band (around 0-4 Hz), whichis caused by the 
reliable occurrence of sharp waves. Sharp waves typically containa high- 
frequency ripple, forming a sharp-wave ripple complex (SWR)*. SWRs 
were recorded from the DVR’—the dominant non-cortical pallial domain 
of sauropsid brains®* *°. REM sleep, by contrast, is characterized by broad- 
band energy, measured in the B band (10-40 Hz) in the cortex and DVR’. 


Origin of sharp waves during slow-wave sleep 


SWRs occured reliably in the DVR during slow-wave sleep, and slow- 
wave sleep alternated regularly with REM sleep (Fig. la—c, Extended 
Data Fig. 1), as reported previously’. High-frequency ripples (around 
70-150 Hz) rode on each sharp wave and contained action potentials. 
Local field potentials (LFPs) were highly correlated across DVR record- 
ing sites (peak correlation 0.74 over 18 h of slow-wave sleep, mean over 
two animals), but sharp waves that were recorded in the anterior medial 
pole of the DVR (amDVR) preceded their counterparts in more poste- 
rior or more lateral regions by up to 200 ms depending on the spacing 
between recording sites (Fig. 1d, e, Extended Data Fig. 1g, h), suggesting 
SWR propagation. 

We next recorded from thick anterior transverse, horizontal and para- 
sagittal slices of DVR in artificial cerebrospinal fluid solution (ACSF) 
(Methods, Extended Data Fig. 2a-f). All configurations produced 
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Fig.1|SWRs originate in the amDVR in sleeping Pogona. a, Simultaneous 
recordings from two sites in the DVR (subcortical). A, anterior; P, posterior. 
Scale bar, 1mm. b, Auto- and cross-correlations of the /B power ratio as 
function of time from sites ina, calculated over 8h of sleep. Coloured strips 
show the 6/f ratio over one single 1,000-s stretch of sleep. c, Short segment of 
the data that were analysed in b (same trace colours). Bottom left, 
magnification of ashort segment of slow-wave sleep, illustrating SWR 
coordination and anterior-posterior delay. Bottom right, detail of aSWR and 
high-pass components (middle and top, black). d, Cross-correlation between 
broadband LFP waveforms (c) during 3.42 h of SWS. Reference trace is the 
anterior recording site: the anterior site leads. e, Delay distribution of sharp 
waves in the anterior (or posterior) DVR triggered on simultaneously recorded 
posterior (or anterior) DVR. See Methods and Extended Data Fig. 1. 


spontaneous SWRs that matched those produced in sleep: a bipha- 
sic waveform (119 + 40 ms) with a ripple (around 70-150 Hz) on the 
trough. SWRs in DVR slices were less frequent than those that occur 
during slow-wave sleep (12.4 +1.8 min (12 DVRslices, 10 animals) versus 
16.45 + 0.98 min™ during slow-wave sleep (5 slow-wave sleep epochs, 
2 sleeping animals)), although not significantly so (P= 0.18, Student’s 
t-test). SWR production in slices was not rhythmically interrupted by 
REM-sleep-like activity as it is during sleep. We patched 12 DVR neu- 
rons (Extended Data Fig. 2g-j) and found that, consistent with sleep 
data, they typically fired O-3 action potentials during SWRs and were 
silent between SWRs. Under voltage clamp (n= 2), neurons displayed 
coincident excitatory and inhibitory input during sharp waves (with 
excitation dominating in current-clamp mode). 

We also used multi-electrode arrays on DVR slices (n =3 brains; Meth- 
ods). As observed in vivo, SWRs propagated from anterior medial to pos- 
terior lateral poles (Fig. 2a—c). The apparent linear velocity of the wavein 
the slice plane was 39 mms “, although propagation contained local angu- 
lar components. We further divided DVR slices into ‘mini-slices’ (n = 13, 
Fig. 2d). Only those from the anterior medial pole produced SWRs, and 
rates did not differ significantly from controls (11.9 +1.7 min"; Fig. 2e, f). 


scRNA-seq indicates a claustrum homologue 


Using a single-cell RNA sequencing (scRNA-seq) strategy, we previously 
mapped the main neuronal types of the reptilian pallium" and described 
heterogeneity among glutamatergic cell types in the Pogona DVR. To 
characterize the amDVR, we sampled Pogona single cells more deeply 
and more extensively (Methods). Using unsupervised graph-based 
Louvain clustering on transcripts from 20,257 cells, we identified 4,054 
pallial glutamatergic neurons that formed 29 glutamatergic clusters 
(Fig. 3a, Extended Data Fig. 3). 

We located these clusters in the Pogona telencephalon using the 
expression of cluster-specific markers, which were detected by in situ 
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Fig.2|SWRs occur spontaneously in DVR slices and originate at the anterior 
medial pole. a~c, CMOS-MEA (complementary metal oxide semiconductor- 
microelectrode array) recordings of SWRs (see also Extended Data Fig. 2) 
propagating across a horizontal slice of the DVR (outlined). a, Instantaneous 
voltage samples at an interval of 20-60 ms. Squares 1-5 indicate the recording 
sites shown in b. Note the initiation at the anterior pole. L, lateral; M, medial. 

b, SWRs from sites 1-5ina. Note the differences in amplitude and onset time 
between sites. c, Signal latency (relative to the earliest channel) over slice plane 
(mean of 12 SWRs; sameslice as ina). Scale bar, 1mm. d-f, SWRs in mini-slices. 
A252-site microelectrode array, 200-pm pitch. d, Thick horizontal slices of 
DVR were subdivided. e, Simultaneous LFPs recorded from coloured sitesind. 
pIDVR, posterior lateral DVR. f, Mean frequency of SWRs in intact slices 
(control (ctrl); =12 slices); amDVR (n =13 mini-slices); and pIDVR (n=9 mini- 
slices). ***P< 0.001. Pvalues: control versus amDVR, P=1, ¢t,;= 0.04; control 
versus pIDVR, P=7.2 10°; t,,= 6.3; amDVR versus pIDVR, P=4.6 x 10°, t,,=6.3 
(two-sided Bonferroni test). Dataare mean +s.e.m. 


hybridization and/or immunohistochemistry". Two clusters (19 and 
20, Fig. 3a) mapped to the amDVR, as shown by expression of hpca 
(which encodes the calcium-binding protein hippocalcin) and adarb2 
(which encodes an RNA-editing enzyme), among others (Fig. 3b-d). 
Clusters 19 and 20 corresponded to medial and lateral subdivisions 
of the amDVR, as shown by expression of the copine-4 (cpne4) and 
nuclear hormone receptor (rorb) genes (Fig. 3e, f). When we repeated 
the mini-slice SWR recordings and labelled those slices afterwards 
with a hippocalcin antibody, we found that only hippocalcin-positive 
mini-slices from the anterior medial pole of the DVR generated SWRs 
(Extended Data Fig. 4). 

Some amDVR markers (for example, gng2, synpr and rgs12; Fig. 3b) 
are known markers of the mammalian claustrum”. To explore these 
molecular similarities further, we used Seurat v.3 to project Pogona 
single-cell transcriptomes onto mouse single-cell transcriptomes” 
on the basis of a joint dimensionality reduction analysis“ (Methods). 
About 63% and 75% of amDVR cells (clusters 19 and 20, respectively) 
projected onto the mouse claustrum transcriptomic cluster (Fig. 3g). 
This suggests that—consistent with developmental observations’*»— 
the Pogona amDVR and the mammalian claustrum are homologous. 

To link our transcriptomic and physiological observations, we ana- 
lysed the expression of genes that encodeion channels and neurotrans- 
mitter receptors in pallial glutamatergic clusters (143 genes detected 
in at least 20% of cells of at least one cluster; Methods). These genes 
were sufficient to distinguish the amDVR from other glutamatergic 
clusters (Extended Data Figs. 3, 5), and contained clusters of correlated 
genes (modules). One module with enriched expression inthe amDVR 
(Fig. 3h) included receptors for noradrenaline, acetylcholine, dopamine 
and serotonin. In mammals, these neuromodulators influence sleep 
rhythms and are released by brain nuclei fromthe hypothalamus tothe 
medulla*>°"8, Glutamatergic neurons in the amDVR were among the 
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few types that co-expressed receptors for all four modulators (Extended 
Data Fig. 5). Hence, the amDVR expresses receptor types that are con- 
sistent with a sensitivity to input from circuits that control brain state. 


The amDVR is extensively connected 


We next mapped the connectivity of the amDVR with areas that have 
been associated with wake-sleep control in mammals (as suggested 
by the above data) and asked whether the amDVR connects widely 
with the rest of the pallium, as the claustrum does in mammals’? 1, 
We identified, where possible, the Pogona homologues of mammalian 
nuclei that are implicated in sleep* ®. Relying on anatomical studiesin 
related species (Methods), we used immunohistochemistry and fluo- 
rescence in situ hybridization (FISH) to identify and map these nuclei 
inthe Pogona diencephalon, midbrain and brainstem (Fig. 3i, Extended 
Data Fig. 6), together with telencephalic areas mapped by scRNA-seq”. 

We mapped amDVR connectivity by local tracer injections”, using an 
adeno-associated virus vector (rAAV2-retro”’) carrying a fluorescent- 
protein gene under the CAG or hSyn promoter for (mostly) retrograde 
labelling (Methods). rAAV2-retro was sometimes co-injected with the 
(mostly) anterograde tracer AAV2/9-CB7-mCherry-WPRE for injection- 
site identification. Because they do not cross synapses”**, these tracers 
revealed the direct targets (AAV2/9-CB7) and sources (rAAV2-retro) of 
the injection site. The results are summarized in Fig. 3j. The names on 
the left all describe telencephalic structures, for which the input and 
output connectivity with the amDVR (‘claustrum’) could be tested. On 
the right are deeper structures that for anatomical reasons could not 
be reached for injection. For these structures, connectivity to the claus- 
trum was established only by retrograde labelling from the amDVR, 
and the question of whether the claustrum projects to those areas will 
require further investigation and direct demonstration. 

The cortical sources of input to the amDVR were the anterior and 
posterior dorsal cortices (Fig. 3j, Extended Data Fig. 6c). Retrograde 
and anterograde tracers revealed no direct projections from the dorso- 
medial cortex (homologue of the hippocampal cornu ammonis (CA)) 
and medial cortex (homologue of the dentate gyrus (DG)) tothe amDVR, 
even though the amDVR projects to both (Fig. 3j). In the subcortical 
pallium, the anterior DVR (aDVR) and posterior DVR (pDVR) showed 
strong projections to the amDVR. The amDVRalso received input from 
the dorsal thalamus (dorso-medial, dorso-lateral and dorso-lateral 
posterior nuclei), prethalamus, hypothalamus, ventral tegmental area, 
substantia nigra, the periaqueductal grey in the midbrain, and the 
locus coeruleus, subcoeruleus and raphe nucleus in the brainstem 
(Extended Data Fig. 6). 

The amDVR projected to the hippocampus (medial cortex and 
dorso-medial cortex), posterior dorsal cortex (potential subiculum 
homologue) and anterior dorsal cortex (neocortex homologue)". In 
the subcortical pallium, projections to the aDVR were dense and exten- 
sive, consistent with sharp-wave propagation (Figs. 1, 2). Projections 
between the amDVR and some ofits targets appeared ordered: more lat- 
eral amDVR projected to rostral aDVR and central amDVR projected to 
caudal aDVR. Conversely, input to the amDVR from the cortex (anterior 
and posterior dorsal cortex) was strongest laterally and weakest medi- 
ally (absent from dorso-medial and medial cortices, or hippocampus). 

Hence, the amDVR is connected with the pallial forebrain and receives 
input from areas that are implicated in wake-sleep control—consistent 
with the widespread expression of many receptor genes that are specific 
to these areas. On the basis of these transcriptomic and anatomical 
data, we conclude that the amDVR is the reptilian homologue of the 
mammalian claustrum. 


The claustrum homologue in turtles 


Having applied similar transcriptomic approaches to those used in 
Pogona to the turtle T. scripta"—a species ona distant branch of the 
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reptilian tree—we looked for a turtle claustrum. Comparison of tran- 
scriptomic data (Methods) yielded four potential clusters in Trachemys 
(Extended Data Fig. 7). Cells in these clusters were located in a region 
knownas the pallial thickening”°””. Turtle pallial thickening and lizard 
amDVR are both in the anterior pallium, consistent with their similar 
developmental originin anterior lateral pallium’; however, turtle pallial 
thickening is lateral to the aDVR and close to the olfactory cortex, rather 
than being fused to the rest of the DVR as the claustrum is in Pogona. 
Architectonics also differed: the Pogona claustrum is nuclear and com- 
posed of isotropically distributed multipolar neurons, whereas turtle 
pallial thickening forms a curved sheet that extends the anterior dorsal 
cortex and is traversed from below by lateral geniculate nucleus (LGN) 
axons en route to the visual cortex”. Principal neurons in turtle pallial 
thickening (revealed after rAAV2-retro injection into the dorso-medial 
cortex) are pyramid-like, with apical and basal dendrites (Extended 
Data Fig. 7d). Despite these differences, slices of turtle pallial thicken- 
ing produced SWRs that led those inthe DVR, as in Pogona. This pallial 
thickening therefore appears to be the turtle claustrum, suggesting 
that a homologue of the claustrum already existed in the common 
ancestor of amniotes. 


Manipulating claustrum activity 


We developed a reduced ex vivo preparation of the Pogona forebrain, 
which enabled direct access to the non-cortical pallium after removal 
of the cortex (Methods). This preparation generated spontaneous 
SWRs in the claustrum and DVR that were similar to those recorded 
in vivo during sleep and to those that occur in slices containing both 
claustrum and DVR (claustrum + DVR) (Extended Data Fig. 8). SWRs 
occurred continuously but more frequently in the forebrain prepara- 
tion (21.6 +5.4 min“, 4 brains) than in slices (12.4 +1.8 min“, n=13). SWRs 
inthe claustrum led those in the DVR (Extended Data Fig. 8f), with delays 
similar to those observed during sleep or in slices of claustrum + DVR 
(11-141 ms, peak mean correlation 0.57, 4 brains). To test the causal role 
of the claustrum in generating SWRs, we injected tetrodotoxin (TTX) 
selectively into the claustrum ex vivo (n=4, 3 animals). This resulted in 
aprolonged silencing of the claustrum, and the concomitant cessation 
of SWRs in the ipsilateral DVR (Extended Data Fig. 8b-d). 

We next generated lesions in one or both claustra in vivo using ibo- 
tenic acid (Methods; three animals). Bilateral recordings from the DVR 
insleeping lesioned animals revealed that the rhythmic modulation of 
B-band activity (REM sleep) was unaffected, but that SWRs (character- 
istic of slow-wave sleep) were eliminated on the side(s) of the lesioned 
claustrum (Fig. 4a—d, Extended Data Fig. 9). These findings show that 
the claustrum is required for the production of SWRs in the DVR during 
slow-wave sleep; that its action is unilateral; and that it is not involved 
in the alternating sleep rhythm itself. 

Because the claustrum receives direct input from areas that are 
implicated in wake-sleep control in mammals and expresses recep- 
tors for their transmitters (Fig. 3j, Extended Data Fig. 5), we tested how 
sensitive SWR production was to those transmitters* °"*. Dopamine 
significantly increased the rate of SWR production, whereas acetyl- 
choline and serotonin decreased it (Fig. 4e). We selected serotonin 
for further experiments. Consistent with tracing data that indicated 
a serotonergic input from the raphe nuclei, the claustrum contained 
serotonin-positive fibres (Extended Data Fig. 10a). Serotonin at con- 
centrations of 1M or higher suppressed SWRs (n=9 claustrum+ DVR 
slices, 9 animals; Extended Data Fig. 10b). This effect was best mimicked 
by the serotonin receptor-1D agonist L703,664 (Fig. 4f), consistent 
with scRNA-seq results (Extended Data Fig. 5). We then superfused 
slices with caged serotonin (Methods). SWRs were suppressed within 
seconds of the onset of illumination, and resumed when illumination 
stopped (Fig. 4g, h). 

The mammalian claustrum is hypothesized to have a role in higher 
cognition’”®”? because of its hub-like connectivity” *° **. Direct 
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Fig. 4| SWR production inthe DVR depends onclaustrum integrity and 
modulation. a—d, Ibotenic-acid-induced lesions of the claustrum and SWRs in 
sleeping lizards (see also Extended Data Fig. 9). a, Short sleep segment showing 
LFP (<150 Hz) from left and right DVRs after unilateral claustrum lesion. Sham- 
lesioned hemisphere (CLA*) is shown in blue; lesioned hemisphere (CLA ) is 
showninred. The arrowheads indicate sharp waves in the DVR. Sleep rhythm is 
intact but sharp waves are nearly absent in slow-wave sleep on the lesioned 
side. b, Sameasa, ina lizard with bilateral lesions of the claustrum. Note the 
absence of sharp waves. c, Cross-correlation of B-band (REM sleep) power 
across hemispheres in lesioned animals. d, Number of sharp waves per slow- 
wave sleep cycle in sham and claustrum-lesioned hemispheres. ***, significantly 
different from sham; P<1.73 x 10°, W= 64,252 (Wilcoxon signed-rank test; 
data from 2 animals, 4 nights, 375 cycles). For details of box plots, see Methods. 
e-h, Experiments in slices of DVR+ claustrum or isolated claustrum. e, Effects 


experimental tests are difficult, however, owing to the anatomy of the 
claustrum””, Using scRNA-seq and tract-tracing techniques, we iden- 
tified a claustrum in two reptiles from distant evolutionary lineages. 
This result, added to mammalian evidence, suggests that a claustrum 
probably existed in the common ancestor of amniotes. The claustrum 
probably derives from the lateral pallium and may correspond to parts 
of the mesopallium in birds***. Thus, if the claustrum has a role in 
higher cognition in mammals, this role may be derived from other func- 
tions ina common amniote ancestor. The claustrum assumes different 
architectonics, which are reflected in neuronal morphology, in two 
distant reptiles. (Of note, differences also exist between marsupial and 
placental mammals**.) Because the claustrum produces SWRs in both 
reptiles, architectonics probably have little role in SWR generation. 
The claustrum participates in the generation and relaying of SWRs 
that are characteristic of slow-wave sleep in Pogona. Given the wide- 
spread connectivity of the claustrum and its input from wake-sleep- 
controlling areas, it may be implicated in coordinating forebrain states 
during sleep. Early experiments in cats” described sleep-like behaviour 
after (though not during) low-frequency stimulation of the claustrum. 
These results remain uncertain because selective stimulation of the 
mammalian claustrum is difficult. More recent results in rodents, using 
markers of synaptic activity*’, suggest that the claustrum is active 
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of superfused noradrenaline (NA; 25 uM; n=7;), dopamine (DA) agonist 
SKF38393 (10 1M; n=7); acetylcholine (ACh) agonist carbachol (50 uM; n=5) 
and serotonin (SHT; 10 uM; n=4) onthe frequency of spontaneous SWRs. 

f, Action of serotonin receptor (SHTR) agonists onthe rate of spontaneous 
SWRsin isolated slices of the claustrum. n=3 experiments (SHTR-1A);n=4 
(SHTR-1B); n=5 (SHTR-1D); n=5 (SHTR-2C); n=4 (SHTR-7). ***, *and*, 
significantly different from baseline; ***P=8.0 x 10°, T=15 (two-sided 
Wilcoxon rank-sum test); *P= 0.04, t,=-2.9; *P= 0.049, t,=—2.3 (paired t-test). 
Data are meants.e.m. (e, f).g, Light-triggered uncaging of serotonin 
suppresses spontaneous SWRsin CLA + DVRslices. Blue shading, light on (hv). 
h, Summary of eight experiments performed as ing, with bins of 10s. Dataare 
mean +s.e.m. For the control experiments, light pulses were applied to ACSF- 
superfused slices. ***, experimental bins significantly different from control; 
P=1.5x10"+, T=36 (two-sided Mann-Whitney rank-sum test). 


during REM sleep. Other studies*’*° suggest that the claustrum acts 


to shut down the cortex, through dominant projections onto cortical 
interneurons. This action would cause a general cortical down-state, 
as is possibly seen during certain phases of slow-wave sleep*°. These 
results collectively suggest a tentative link between the claustrum and 
sleep in mammals. 

During sleep in Pogona, SWRs originate in the claustrum and propa- 
gate tothe rest of thenon-cortical pallium—the mammalian homologue 
of the amygdaloid complex". By virtue of ascending input from areas 
that control the wake-sleep cycle, the claustrum is ideally positioned 
to act as arelay for wake-sleep-related states in the forebrain. During 
sleep the claustrum alternates between SWR production and REM, 
presumably driven by alternating ascending inputs that are themselves 
independent of claustrum integrity. Claustrum projections suggest a 
distributed action on the cortex, hippocampus, amygdala and other 
areas of the forebrain. SWRs in sleeping Pogona in vivo are each cor- 
related with a short phasic inhibition of the cortex (consistent with 
stimulation experiments (Extended Data Fig. 8) and with results in 
rodents*”*”) followed by cortical excitation? (consistent with coordina- 
tion between area CA1 and the medial prefrontal cortex in rodents”). 
The mechanisms that underlie this coordination, and the nature of 
sleep-related inputs to the claustrum, now require characterization. 
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Methods 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and the investigators were not 
blinded to allocation during experiments and outcome assessment. 


Animals 
Lizards (Pogona vitticeps) of either sex, weighing 100-400 g, were 
obtained from our institute colony, selected for sex, size, weight, health 
status and wild-type colouring. Wild-type turtles (7. scripta elegans or 
Chrysemys picta) of either sex, weighing 200-400 g, were obtained 
from an open-air breeding colony (NASCO Biology). The animals were 
housed in our state-of-the-art animal facility. 

All experimental procedures were performed in accordance with 
German animal welfare guidelines: permit no. V54-19c 20/15- F126/1005 
delivered by the Regierungspraesidium Darmstadt (E. Simon). 


Lizard surgery for chronic recordings 
Twenty-four hours before surgery, the lizard was administered analge- 
sics (butorphanol, 0.5 mg kg subcutaneously; meloxicam, 0.2 mg kg? 
subcutaneously) and antibiotics (marbofloxacin, marbocyl, 2mg kg"). 
On the day of surgery, anaesthesia was initiated with isoflurane, and 
maintained with isoflurane (1-4 vol%) after intubation. The lizard was 
placed in a stereotactic apparatus after ensuring deep anaesthesia 
(absence of corneal reflex). Body temperature during surgery was 
maintained at 32 °C using a heating pad and oesophageal temperature 
probe. Heart rate was monitored using a Doppler flow detector. The skin 
covering the skull was disinfected using 10% povidone-iodine solution 
before removal with a scalpel. A small (around 3 x 2-mm) craniotomy 
was then drilled posterior lateral to the parietal eye along the midline. 
The dura and arachnoid layers covering the forebrain were removed 
with fine forceps, and the pia was removed gently over the area of elec- 
trode insertion (dorsal or dorso-medial cortex). The exposed skull was 
covered with a layer of ultraviolet (UV)-hardening glue, and the bare 
ends of two insulated stainless steel wires were secured in place sub- 
durally with UV-hardening glue to serve as the reference and ground. 
For insertion of silicon probes, probes were mounted ona Nanodrive 
(Cambridge Neurotech) and secured to a stereotactic adaptor. On 
the day after the surgery, probes were slowly lowered into the tissue 
(about 0.9-1.2 mm). The brain was covered with Duragel followed by 
Vaseline. After connecting grounds, the skull, craniotomy and probes 
were secured with dental cement. After surgery, lizards were released 
from the stereotactic apparatus and left on a heating pad set to 32 °C 
until full recovery from anaesthesia. 


In vivo electrophysiology 

One week before surgery, lizards were habituated to asleep arena for 
aminimum of two nights. One to two hours before lights off, the lizard 
was placed in the sleep arena, which was itself placed in a3 x 3 x 3-m 
EM-shielded room. The animal was left to sleep and behave naturally 
overnight, and returned to its home terrarium 3-4 h after lights on. 
The animal then received food and water. Recordings were made from 
the cortex, anterior DVR (including claustrum) and/or posterior DVR 
of chronically implanted adult lizards. Electrodes were 32-channel 
silicon probes (50-ym pitch, 177-pm? surface area for each site; in 2 
rows of 16 contacts). 

Recordings were performed with a Cheetah Digital Lynx SX system and 
HS-36 headstages of unity gain and high input impedance (~1 TOhm). The 
headstage was connected with a headstage adaptor to aconnector onthe 
head, anda lightweight shielded tether cable connected the headstage 
to the acquisition system. Recordings were grounded and referenced 
against one of the reference wires. Signals were sampled at 32 kHz, with 
wide-band 0.1-9,000 Hz. Electrophysiological traces were typically 
low-pass filtered at 150 Hz with a 2-pole Butterworth filter for display. 


Ibotenic acid lesion experiments 

In preparation for claustrum lesion experiments we carefully removed, 
using fine forceps in anaesthetized animals, the pia overlaying the 
dorsal cortex and inserted a beveled quartz micropipette at an angle 
of 90° to the surface, to a depth of 1,050-1,150 pm from the surface, 
at appropriate anterior—posterior and medial-lateral coordinates to 
reach the centre of the claustrum. Ibotenic acid (400-600 nl; 5 pg ptt 
in PBS, pH 7.2) was injected at a rate of 50-100 nl min™ (UMP3, World 
Precision Instruments). The injection pipette was retracted 3 min after 
the end of injection. Two silicon recording probes were subsequently 
positioned bilaterally, as described above, for DVR recordings. For 
sham claustrum lesions, we injected PBS alone (same methods and 
volumes) on the sham-lesion side. Recordings were carried out each 
night from one to six days after surgery. Effects of the lesions could 
already be observed 24 hafter surgery. A week after each experiment, 
the animal was killed and its brain was sectioned and stained with Nissl 
for histological confirmation. 


SWR delay calculation 

Sharp waves were detected as described previously (template-based 
detection’). After independently detecting SWRs on probes inthe aDVR 
and pDVR throughout the dataset, the delay between SWRs across 
probes was calculated by pairing SWRs on one probe with the SWR 
closest in time onthe second probe. Pairs occurring more than 500 ms 
apart were ignored. 


SWRsat the slow-wave sleep-REM sleep transition point 

REM and slow-wave sleep periods and the timing of their transition were 
calculated as described previously*. Average SWR rates and amplitudes 
were calculated by averaging the values triggered on all slow-wave 
sleep-REM sleep transition points within 100-ms bins, and smoothing 
the resulting histogram with a Gaussian filter (s.d., 25 ms). 

In ibotenic acid lesion experiments, sleep cycles were determined 
using median filtered B-band power (10-40 Hz, as above), for a 6-h 
period beginning 3 h after the recording start time. The time course 
of B was filtered above 0.001 Hz with a 2-pole Butterworth filter, and 
additionally smoothed with a Gaussian filter (s.d., 20s). Periods of slow- 
wave sleep were conservatively defined as ones in which this signal was 
less than1s.d. below the mean. To avoid false sharp-wave detectionsin 
lesioned animals (which demonstrate reduced low-frequency power), 
sharp waves were detected by thresholding the voltage trace (1.5-2.5 
s.d. below the mean) after low-pass filtering at 4 Hz with a 2-pole But- 
terworth filter. The threshold was adapted to each lesion experiment 
and was the same for both hemispheres within each experiment. 


Sharp-wave shape statistics 
For comparison with ex vivo and slice sharp waves, sharp waves detected 
in vivo were low-pass filtered at 20 Hz using a 2-pole Butterworth filter. 


Ex vivo brain and slice preparations 

Adult lizards or turtles were deeply anaesthetized with isoflurane, 
ketamine (60 mg kg) and midazolam (2 mg kg”). After loss of the 
corneal reflex, the animals were decapitated and the heads were rap- 
idly transferred into cooled ACSF solution (lizard: 126 mM NaCl, 3 mM 
KCI, 1.8 mM CaCl,, 4 mM MgCl, 24 mM NaHCO,, 0.72mM NaH,PO,, 20 
mM glucose, pH 7.4; turtle: 96.5 mM NaCl, 2.6 mM, KCI, 4 mM CaCl, 
2mM MgCl, 31.5 mM NaHCO,, 20 mM glucose, pH 7.4) bubbled with 
carbogen gas (95% O,, 5% CO,). 

Ex vivo intact subcortical slabs were prepared with iridectomy scis- 
sors, after isolation of the lizard brain. For slice preparation, coronal, 
horizontal or sagittal subcortical area slices (700 um thick) were cut 
using a vibratome (VT 1200S, Leica) in ice-cold oxygenated ACSF. The 
slices were allowed to recover for at least 60 min and then submerged 
inachamber filled with oxygenated ACSF (lizards: 126 mM NaCl, 3 mM 
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KCI, 1.8mM CaCl, 1mM MgCl,, 24 mM NaHCO,, 0.72 mM NaH,PO,, 20 
mM glucose, pH 7.4; turtle: 96.5 mM NaCl, 2.6 mM KCI, 4 mM CaCl, 
2mM MgCl, 31.5 mM NaHCO,, 20 mM glucose, pH 7.4) at 20-22 °C. 


Ex vivo brain and slice physiology and SWR detection 

During recordings, oxygenated ACSF (lizard: 126 mM NaCl, 3 mM KCI, 
1.8 mM CaCl,, 1.2 mM MgCl, 24mM NaHCO,, 0.72 mM NaH,PO,, 20 mM 
glucose, pH 7.4; turtle: 96.5 mM NaCl, 2.6 mM, KCI, 4 mM CaCl,, 2 mM 
MgCl,, 31.5 mM NaHCO,, 20 mM glucose, pH 7.4) was constantly super- 
fused at 18-20 °C (ex vivo) and 18-21 °C (slices) at 4 ml min”. LFPs were 
recorded using microelectrode arrays, silicon probes or glass pipettes 
filled with ACSF. The electrodes were carefully placed in the targeted 
areas with micromanipulators. Signals were low-pass filtered at 2 KHz 
and digitized at 20 kHz. For analysis of sharp waves, the traces were 
further low-pass filtered at 20 Hz using a 2-pole Butterworth filter. SWRs 
were detected at a threshold of 3x s.d. of the total signal. The detected 
events were visually scrutinized and manually rejected if they were erro- 
neously detected. Events lasting less than 30 ms were also discarded 
as they were typically artefacts. For claustrum electrical-stimulation 
experiments, stimulation pulses lasted 50 ps and were delivered with 
bipolar electrodes. Multi-unit extracellular recordings in cortex were 
carried out with glass micropipettes filled with ACSF. Mini-slices were 
cut witha sharp razor blade and were 0.61-3.12 mm’ in surface area. 


CMOS-MEA experiments 
The slices were placed over a high-density microelectrode array (3Brain 
AG) of 4,096 electrodes (electrode size, 21 x 21 1m; pitch, 81 1m; 64 x 64 
matrix; 5.12 x 5.12-mm area). During recording, ACSF perfusion was 
interrupted to avoid movements of the slices and noise as a result of 
ACSF flux. Signals were sampled at 18 kHz witha high-pass filter at 1 Hz. 
Saturating or damaged channels were detected as channels whose 
voltage crossed + 500 pV and were removed from later analysis. Channel 
data were low-pass filtered at 20 Hz and z-scored, and troughs greater 
than 5(z) below the mean on the channel with the largest signal were 
takenas sharp waves. The signal + 400ms from these peak times, onall 
channels, was taken as a SWR episode. For calculation of SWR latency, 
SWRs were averaged on each channel and the time that the average 
signal crossed 1(z) belowthe mean was taken as the start of the SWR on 
that channel. Latency was calculated relative to the time of the SWR of 
the earliest channel. Channels that did not cross 1(z) were considered 
maximum latency. The resulting latency image was filtered with a3 x 3 
median filter to remove the effect of bad channels, and upsampled by 
a factor of 10 for display. 


Whole-cell patch-clamp recordings of DVR and claustrum 
neurons 

Long-shank patch pipettes (6-8 MQ) were pulled from borosilicate 
glass with a Sutter P1000 electrode puller. Pipettes were filled with 
internal solution (140mM K-gluconate, 4 mM NaCl, 14 mM phospho- 
creatine, 10 mM HEPES, 4 mM Mg-ATP, 0.3 mM Na-GTP, 4 mg mI? 
biocytin). Experiments were carried out on an upright Olympus BX61WI 
microscope with 5x and 40x water-immersion objectives and cells were 
patched under visual guidance. Excitatory and inhibitory postsynaptic 
currents were recorded in the voltage-clamp configuration with the 
same cell held at either -70 mV or +10 mV. Simultaneous patch-clamp 
and LFP recordings were carried out with an EPC10 Quadro amplifier 
(HEKA). 


Pharmacology 

Serotonin hydrochloride (0.1-30 uM), carbamoylcholine chloride (SO 
pM), noradrenaline bitartrate (25 pM), SKF38393 hydrobromide (10 
pM), (R)-(+)-8-hydroxy-DPAT hydrobromide (2 iM), L-703,664 suc- 
cinate (1 uM), CP 809101 hydrochloride (0.1 1M), LP44 (0.2 uM) and 
TTX (20 uM) were diluted to their final concentrations in ACSF (126 mM 
NaCl, 3mM KCI, 1.8 mM CaCl,, 1.2 mM MgCl, 24 mM NaHCO,, 0.72 mM 


NaH,PO,, 20 mM glucose, pH 7.4). For slice experiments, drugs were 
continuously bath-applied after a baseline recording period of 5-20 
min. For ex vivo experiments in Extended Data Fig. 8, TTX dissolved 
in ACSF was injected into the claustrum through a glass micropipette 
using a10-ml syringe pressurizer (20-30 hPa for 15 min). For serotonin 
uncaging, RUBi-SHT (Abcam) (10 pM) was bath-applied, and white light 
(400-700 nm, 0.11 W cm”, TH4-200, Olympus) was turned on and off 
at chosen intervals (for example, 80s). 

We tested several metabotropic SHTR agonists. Of those, the SHTR- 
1D agonist L-703,664 best mimicked the effects of serotonin, consist- 
ent with the high expression of SHTR-ID in glutamatergic neurons in 
the claustrum (Extended Data Fig. 5a). The SHTR-7 agonist LP44 had 
no effect (Fig. 4f), which is also consistent with the low expression 
of SHTR-7 in claustrum excitatory neurons. The SHTR-2C agonist CP 
809101 increased the rate but not the amplitude of SWRs. 


scRNA-seq libraries 

Adult male lizards (150-400 g) were deeply anaesthetized with isoflu- 
rane, ketamine (50 mg kg") and midazolam (0.5 mg kg”) and decapi- 
tated. The head was immersed in ice-cold oxygenated ACSF (126 mM 
NaCl, 3 mM KCl, 2 mM CaCl,, 4 mM MgCl,, 24 mM NaHCO,, 0.72 mM 
NaH,PO,, 20 mM glucose, pH 7.4). The brains were perfused to remove 
blood from the vasculature. The data shown originate from four librar- 
ies constructed from data from one male lizard (160 g, 20 months old). 

Thereafter, the brain was removed and immersed in oxygenated 
ice-cold ACSF. The brain was embedded in 4% low-melting agarose, 
glued to the base of a vibratome (VT1200S, Leica) and immersed in 
ice-cold oxygenated ACSF, and 500-um-thick sections were prepared 
(speed, 0.08 mms"). The sections were individually inspected under 
a dissection microscope (Stemi 2000-C, Zeiss) and anatomical regions 
of interest were dissected (telencephalon, amDVR). These slices were 
cut with fine scissors (Fine Science Tools) into small cubes of tissue 
(around 500 x 500 x 500 um). 

These were transferred to dissociation buffer (20 U mI" papain, 200 
U ml™ DNase I, 25 pg mI liberase TM, 1 pM TTX, 100 1M D-APV) and 
triturated with fire-polished, silanized glass pipettes of decreasing tip 
diameter (around 10 passes per pipette). After every pipette change 
the supernatant (dissociated cell suspension) was removed and filtered 
througha strainer with 100-~m mesh. 

The pooled dissociated cell suspension was diluted to 20 ml (with 
Hibernate A —CaCl,), transferred to a50-ml reaction tube and filtered 
witha strainer with 40-~um mesh. Then5 ml of 4% bovine serum albumin 
(BSA) in Hibernate A —CaCl, was added to the bottom of the tube with 
along-stemmed glass pipette. The solution was spun ina centrifuge at 
300g at 4 °C (lowest acceleration and brake) for 5 min. The supernatant 
was removed and the cell pellet resuspended in 20 ml of Hibernate 
A-CaCl,. This procedure was repeated for a second-gradient clean-up. 
The pellet was then resuspended in an appropriate amount (50-200 ul) 
of Hibernate A—CaCl, -MgCl, and the cell concentration was measured 
with a Fuchs-Rosenthal cell-counting chamber (Brand). 

The cell suspension was then diluted to 466 cells pl and used as 
input to halfa chip (four samples) of the 10X Chromium system (Chem- 
istry v.3) with a targeted cell recovery of 7,000 cells per sample. The 
library construction was performed according to the manufacturer’s 
instructions. 

The final four libraries were quantified using Qubit fluorometer 
(Thermo Fisher Scientific) and sequenced five times ona DNA sequencer 
(NextSeq 500, Illumina) with an average depth of 442,806,563 reads 
per library. 


Analysis of transcriptomics data 

Raw sequencing data were processed using Cellranger v.3.0 (10X 
Genomics). Raw reads were demultiplexed and filtered with the cell- 
ranger mkfastq function with default settings. To generate digital gene 
expression matrices, demultiplexed reads were aligned to the Pogona 


genome with the cellranger count function, setting the force-cells 
parameter to 7,000. For reads alignment, we reannotated the Pogona 
genome (assembly 1.1.0, NCBI accession number GCF_900067755.1, 
10 April 2017) using the same 3’-end MACE (massive analysis of cDNA 
ends) data and the approach described previously”. 

Digital gene expression matrices were analysed in R, using the 
Seurat v.3.0 package”. Cells were filtered by number of genes (more 
than 800 genes per cell) and percentage of mitochondrial genes 
(lower than 5%), yielding a total of 20,257 cells, with a median num- 
ber of 2,278 transcripts and 1,349 genes per cell. Data were normal- 
ized by the total number of transcripts detected in each cell, and 
regressed by the number of genes and of transcripts (by setting vars. 
to.regress = c(“nFeature_RNA’”, “nCount_RNA”) in ScaleData function). 
Variable genes were identified after variance standardization from 
an estimate of the mean variance relationship (FindVariableFeature, 
method = “vst”), and the top 1,000 highly variable genes were used 
for principal component analysis. The first 30 principal components 
were used for Louvain clustering (FindClusters, resolution =0.2) and 
for dimensionality reduction with UMAP* (RunUMAP with default 
settings). 

After this first round of analysis, neuronal clusters (characterized by 
high expression of pan-neuronal markers, suchas the synaptic protein 
snap25) were analysed again using the above procedure with the fol- 
lowing settings: more than 800 genes per cell, 2,000 highly variable 
genes, 30 principal components, clustering resolution = 0.2. This led 
to the identification of 12 neuronal clusters. One cluster of doublets, 
recognized by the co-expression of markers of glutamatergic and 
GABAergic (y-aminobutyric acid-producing) neurons, was removed, 
leaving 9,777 neurons. These were analysed again with the same 
parameters (but clustering resolution = 2) to yield 33 clusters (Extended 
Data Fig. 3). 

From this neuronal dataset, we identified 4,054 pallial glutamater- 
gic neurons (with more than 1,000 genes per cell) that co-expressed 
the vesicular glutamate transporters slc17a7 and slc17a6. Further 
subclustering of these cells (analysis settings: 2,000 highly variable 
genes, 34 principal components, clustering resolution = 3) led to the 
identification of 29 clusters (Fig. 3a, Extended Data Fig. 3). To assign 
an identity to each of these clusters, we analysed the expression of 
marker genes with known tissue expression patterns”. This allowed us 
to define the pallial region to which each cluster belongs (for example, 
hippocampus for zbtb20-expressing clusters). Further annotation of 
cluster identities (Extended Data Fig. 3) was based on the expression 
of selective markers or combinations of marker genes, identified from 
the transcriptomics data. Note about gene nomenclature, all reptilian 
gene names are in lower casein the Article as per Nature style; however, 
in the extended data figures, reptilian gene names are in uppercase, 
according to convention. 


Analysis of ion-channel and neurotransmitter-receptor genes 
We mined the Pogona genome for the following gene families: noradren- 
aline, acetylcholine, serotonin and dopamine receptors; calcium, chlo- 
ride, sodium and potassium channels; and GABA, glutamate, adenosine, 
cannabinoid, glycine and histamine receptors. This yielded 270 genes 
intotal. Of these, 143 were kept for further analysis, because they were 
detected in atleast 20% of the cells of at least one glutamatergic cluster 
(Extended Data Fig. 5a). 

To calculate pairwise cluster correlations (Pearson correlations, 
Extended Data Fig. 5b), we used this set of 143 genes and average cluster 
expression data (calculated from normalized and log-transformed data 
with the AverageExpression function in the Seurat package). A distance 
matrix was calculated from the correlation matrix, and used for hierar- 
chical clustering (R package hclust) with the Ward.D2 linkage method. 

The gene expression matrix from above was transposed to calculate 
gene-gene correlations. The gene dendrogram was also calculated with 
hierarchical clustering and the Ward.D2 linkage method. 


The heat map in Fig. 3h was generated from the matrix of 29 gluta- 
matergic clusters (columns) and average expression of the 143 genes 
(rows). The data matrix was scaled by columns, and the heat map was 
plotted with the heatmap.2 function from the R package gplots. The 
dendrogram of glutamatergic clusters is based on Euclidean distance 
and Ward.D2 linkage. 


Mapping of single-cell transcriptomes across species 

To map Pogona single-cell transcriptomes on mouse single-cell data, 
we used the dataset from a previous study”, which is available on the 
dropviz.org website. In this dataset, pallial glutamatergic neurons 
were sampled from three regions: ‘hippocampus, ‘frontal cortex’ and 
‘posterior cortex’. These dissections encompass several cell types. For 
example, ‘frontal cortex’ includes the claustrum, and ‘hippocampus’ 
includes the subiculum and entorhinal cortex. Raw data were processed 
through the Seurat pipeline (normalization, scaling, selection of vari- 
able genes) and glutamatergic clusters and subclusters were selected, 
according to the cluster and subcluster identities provided previously 
(ref.° and dropviz.org). Subclusters were downsampled toamaximum 
number of 200 cells per subcluster, yielding a total of 17,455 cells. 

Comparative analysis of Pogona and mouse was limited to one-to- 
one orthologues, according to the orthology annotations provided 
by Ensembl (Pogona assembly pv1.1 and mouse assembly GRCm38. 
p6, one-to-one orthologues downloaded on 1 May 2019). Of 13,273 
one-to-one orthologues, 10,693 were detected in both the mouse and 
Pogona datasets and used for the comparative analysis. 

The Pogona and mouse data were analysed jointly following the 
approach described previously“. In brief, after normalization and 
scaling, 1,500 highly variable genes were identified in each dataset. 
The union of these sets of variable genes was used for ajoint canonical 
correlation analysis (CCA). The first 15 canonical components were 
then used to identify 2,626 transfer anchors; that is, pairs of cells with 
matching neighbourhoods (mutual nearest neighbours) in the two 
transcriptomics spaces (function FindTransferAnchors from Seurat). 
These anchors were then used to project Pogona cells (query dataset) 
onthe mouse dataset (reference dataset), using the TransferData func- 
tion from Seurat. The projection is based ona weighted classifier that 
assigns a classification score on the basis of the distance of each cell 
from the transfer anchors. Figure 3g represents the result of the clas- 
sification, showing the fraction of single cells from each Pogona cluster 
that map to each of the mouse subclusters (mouse subclusters without 
matching lizard cells are not indicated in the figure). 

The approach described above was also used to project the tran- 
scriptomes of turtle pallial glutamatergic cells onto the Pogona data 
(Extended Data Fig. 7a). The turtle data are froma previous study”. The 
comparison was based on 9,820 one-to-one orthologues detected in 
both species. For this analysis, the top 2,000 variable genes of each 
dataset were used for CCA. The first 25 canonical components were 
used to compute 3,406 transfer anchors. 


Identification of Pogona brain areas with a potential role in 
brain-state regulation 

Areas known to play a part in controlling brain state have been, over 
the past decades, identified ina number of mammalian species. Those 
areas can be identified by their location (for example, within the hypo- 
thalamus, midbrain or brainstem), their axonal projections and the 
neuroactive substances that their neurons contain and release (and 
thus potential marker genes). To our knowledge, no such description 
exists at present for the brain of the bearded dragon (Pogona), but 
anatomical studies of homologous areas have been performed in other 
reptilian species*® >. These references were used to identify relevant 
brain areas, including preoptic area, supramammillary nucleus® and 
tuberomammillary nucleus in the hypothalamus; ventral tegmental 
area, substantia nigra and periaqueductal grey in the midbrain; and 
lateral dorsal tegmental nucleus, locus coeruleus, subcoeruleus and 
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raphe nucleus inthe brainstem. The location and identity of these areas 
were established in Pogona by immunohistochemistry and/or FISH 
using appropriate neuronal markers, combined with Nissl stains of 
brain sections. Tyrosine hydroxylase (a marker of catecholaminergic 
neurons) was used to identify preoptic area, ventral tegmental area, 
substantia nigra, periaqueductal grey and locus coeruleus; choline 
acetyltransferase was used to identify lateral dorsal tegmental nucleus; 
histamine was used to identify tuberomammillary nucleus; serotonin 
was used to identify raphe nucleus; and subcoeruleus identification was 
based onthe prior identification of lateral dorsal tegmental nucleus and 
locus coeruleus and by the expression of slc17a6 (vesicular glutamate 
transporter 2; a marker of glutamatergic neurons) by in situ hybridi- 
zation (Extended Data Fig. 6a). The expression of slcC17a6 by in situ 
hybridization was also used for the identification of supramammillary 
nucleus” (Extended Data Fig. 6a). 


Pogona whole-brain images 

Pogona brain reconstruction (Fig. 3i) was based on images obtained 
with a pCT scanner, and the ‘surface’ function of the Imaris software 
(Oxford Instruments). The boundaries of relevant nuclei were deter- 
mined from consecutive serial histological sections. The serial images 
were aligned and assembled to three-dimensional (3D) volumes using 
the Voloom software, and then imported into Imaris and aligned with 
the 3D data. The boundaries of some areas identified by retrograde 
tracing were defined from GFP and Nissl staining patterns. 


Immunohistochemistry and in situ hybridization 

The lizards were deeply anaesthetized with isoflurane, ketamine (60 
mg kg") and midazolam (2 mg kg) until loss of the foot-withdrawal 
reflex. Pentobarbital (10 mg kg”) was then administered by intraperi- 
toneal injection. After loss of the corneal reflex, the lizard was per- 
fused transcardially with cold PBS (1.47 x 103M KH,PO,, 8.10 x 10°M 
Na,HPO,-12H,0, 2.68 x 10° M KCI, 1.37 x 107M NaCl) followed by 4% 
paraformaldehyde (PFA) in PBS. The brain samples were post-fixed 
with 4% PFA-PBS for 16 h at 4 °C and subsequently immersed in 30% 
sucrose for 24 hat 4 °C. The brain area was sectioned coronally (60 pm) 
with a microtome at —24 °C. The sections were permeabilized for 30 
min at room temperature in blocking solution (PBST: PBS with 0.3% 
Triton X-100 and 10% goat serum) and incubated with primary anti- 
bodies (anti-GFP, A10262, Invitrogen, chicken, 1:1,000; hippocalcin, 
ab24560, Abcam, rabbit, 1:1,000; ChAT, AB144P, Merck, goat, 1:100; 
mTH, 22941, ImmunoStar, mouse, 1:100; rabTH, AB152, Merck, rabbit, 
1:200; histamine, 22939, ImmunoStar, rabbit, 1:100; serotonin, MAB352, 
Merck, rat, 1:100) in blocking solution overnight at 4 °C. After washing 
with PBST three times, the samples were incubated with secondary 
antibodies conjugated with appropriate secondary antibodies (1:500, 
all from Invitrogen) in blocking solution for 4h at room temperature, 
followed by three washes with PBST. Some slices were counterstained 
with NeuroTrace 435/455 blue-fluorescent Nissl stain (N21479, Invit- 
rogen, 1:200) in PBS for 2 h at room temperature. After rinsing with 
PBS, the samples were mounted with Dako Fluorescence Mounting 
Medium (83023, Dako) or Roti-Mount FluorCare DAPI (HP20.1, Carl 
Roth). Images were acquired using a confocal system or fluorescent 
microscopy at 10x, 20x or 40x. Chromogenic in situ hybridization and 
dual colorimetric in situ hybridization were performed following the 
protocols previously described". 


Fluorescent in situ hybridization by RNAscope 

The lizards were deeply anaesthetized as described above. After loss 
of corneal reflex, the animals were killed by decapitation. Brains were 
dissected outimmediately, embedded in OCT ona dry-ice ethanol bath 
and stored at —80 °C. Fresh frozen brains were sectioned at 25 pm on 
a Thermo Fisher Scientific CryoStar NX70 cryostat and placed onto 
SuperFrost-coated (Thermo Fisher Scientific) slides. Some slides 
were stored at —80 °C after air-drying. RNAScope hybridization was 


performed according to the manufacturer’s instructions. We used the 
RNAscope Multiplex Fluorescent assay (Advanced Cell Diagnostics) 
for fresh-frozen sections. Target genes and probe catalogue numbers 
were Py-CHAT-C2, 522631-C2; Pv-SLC17A6-C1, 529431-C1. Fluorescent 
Nissl was used for counterstaining. Slides were mounted with ProLong 
Gold Diamond Antifade Mountant (P36970, Thermo Fisher Scientific). 
Images were acquired with a digital slide scanner (Pannoramic MIDIII, 
3DHISTECH) at 20x magnification. 


Tract tracing 
The lizards were anaesthetized as described for in vivo recordings. 
Extensive preliminary searches for useful AAV serotypes for reptil- 
ian brains and for appropriate incubation conditions were carried 
out by L. Pammer™. The tracers (rAAV2-retro-CAG-GFP, 37825-AAVrg; 
rAAV2-retro-hSyn-EGFP, 50465-AAVrg; AAV9-CB7.Cl.mCherry.WPRE. 
RBG, 105544-AAV9; all from Addgene, https://www.addgene.org) 
were injected into one or two forebrain locations (for example, dorso- 
medial cortex, DVR, amDVR, and so on). Four to six weeks later, the 
animals were deeply anaesthetized as described above, and after loss 
of corneal reflex, the animals were killed by decapitation. Brains were 
dissected out, processed for histology, sectioned and imaged. The data 
presented come from 18 of 30 injected brains. The remaining 12 brains 
were rejected either because the viral injections failed or because the 
injections were not sufficiently specific. Targeting specific regions in the 
brain of Pogona and Trachemys is difficult because the brain is loosely 
contained in the cranial cavity and its position relative to the cranium 
and reliable landmarks is thus variable: the brain floats in CSF, attached 
by cranial nerves. As a consequence, there are no reliable stereotactic 
coordinates based on cranium landmarks. The lateral ventricles are 
large. The external appearance of the forebrain also lacks reliable land- 
marks (for example, blood vessels or sulci). Finally, these animals are 
not standardized species, bred over generations to reduce variability. 
Note that, because rAAV2-retro does not infect all neuronal types 
equally”, the results from negative retrograde labelling should be 
confirmed with other methods. Conversely, the connectivity estimated 
with the tracers we used is likely to be underestimated. 


Statistics and reproducibility 

Unless stated otherwise, data are mean +s.e.m. For comparisons of two 
groups we performed a two-tailed unpaired t-test, two-tailed paired 
t-test, Mann-Whitney rank-sum test or Wilcoxon signed-rank test, as 
appropriate (all two-sided). For multiple comparisons we performed 
a Bonferroni test. Significance was determined at the 0.05 a level for 
all statistical tests. For box plots (Fig. 4d): margins are 25th and 75th 
percentiles; red, median; whiskers, boundaries before outliers; outliers 
(+) are values beyond 1.5x interquartile range from the box margins. 
Experiments were repeated independently several times with simi- 
lar results, with numbers of repetitions as follows: Fig. lb-e: 7 times; 
Fig. 2a—c: 4 times; Fig. 2e: 13 times (amDVR) and 9 times (pIDVR); Fig. 3: 
4 times (a, b, d-h) and 10 times (c); Fig. 4a—c: 3 times; Extended Data 
Fig. la—d: 7 times; Extended Data Fig. 1h: 3 times; Extended Data Fig. 2b: 
15 times; Extended Data Fig. 2g, i: 12 times; Extended Data Fig. 2h, j: twice; 
Extended Data Fig. 3f: 3 times; Extended Data Fig. 4a:3 times; Extended 
Data Fig. 4b: 13 times (amDVR) and 9 times (pIDVR); Extended Data 
Fig. 6a—c: 3 times (for all except for c5-7, for which experiments were 
reproduced once in 5 experiments) (see Fig. 3 legend); Extended Data 
Fig. 7b: 3 times; Extended Data Fig. 7c: 5 times; Extended Data Fig. 7d: 
4 times; Extended Data Fig. 7e, g: 3 times; Extended Data Fig. 8a, e, f: 4 
times; Extended Data Fig. 8b—d: 4 times; Extended Data Fig. 9a—d: twice 
(a, b) and 3 times (c) (claustrum lesions (d) were confirmed in all these 
experiments); Extended Data Fig. 10a, b: twice (a) and 3-4 times (b). 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Sequencing data have been deposited in the NCBI Sequence Read 
Archive: BioProjects PRJNA591493 (lizard) and PRJNA408230 (turtle). 
Links to those archives and to analysis code can be found at: https:// 
brain.mpg.de/research/laurent-department/software-techniques. 
html. Data are also available from the corresponding author on request. 
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Extended Data Fig. 1| Further description of SWR statistics and propagation 
in vivo. a, The amplitude and frequency of sharp waves vary as the animal 
transitions between slow-wave and REM sleep. Top, illustrative LFP trace 

(<150 Hz) showing a decrease in sharp-wave amplitude and frequency around 
the slow-wave-REM transition point. Open circles indicate detected sharp 
waves? (see Methods). Data in a-d are from the same animal and a single night, 
and correspond tothe recording in Fig. 1 (anterior recording site, red). 
Statistics are based onn=11,123 sharp waves. b, Distribution of sharp-wave 
width (measured at half peak amplitude) and peak amplitude from the animal in 
aand Fig. 1. Pr, probability. c, Average sharp-wave trace +1s.d. (grey) calculated 
over n=11,123 sharp waves. d, Inter-event interval (IEI) for sharp waves 
recorded during slow-wave sleep. The y axis (probability) ison alogarithmic 
scale. e, f, Summary of data recorded over five nights from two animals. Each 
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circle represents the mean of one night; black line shows the median. e, Mean 
inter-event intervals during slow-wave sleep. f, Mean sharp-wave width and 
amplitude (n =8,055-13,494 sharp waves per night). g, Delay distributions of 
sharp waves in anterior (or posterior) DVR, triggered on simultaneously 
recorded posterior (or anterior) DVR. Sharp waves from three nights (animal 1; 
n=24,501sharp waves) and two nights (animal 2; n=13,070 sharp waves). 

h, Locations of simultaneous recording sites in the aDVR (circles). Left, 
schematic of recording configuration. Middle and right, confocal images 
highlighting the recording sites, as identified by electrolytic lesions and Dil dye 
that was applied to the back of the silicon probes. Post hoc staining with an 
antibody against hippocalcin was used to determine the borders of the 


claustrum (see Fig. 3). 
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Extended Data Fig. 2| Comparison of SWR statistics across preparations and 
recording conditions. a, Slice preparation (see Methods) for field-potential 
recordings. b, Spontaneous sharp waves (LFP; <150 Hz) and corresponding 
ripples (high-pass (HP) band; 70-150 Hz) inthe amDVR. Insets: top left, 
magnification of the SWR marked with a dotted box; top right, 350 ripples; 
high-pass signal intensity (HPI) >70 Hz aligned on trough of sharp wave 
(overlaid as average). c, Distribution of amplitude (x) and width (y, full width at 
half maximum) of SWR events ina representative DVR slice. d, Distribution of 
SWR amplitude and width (as inc) ina representative ex vivo preparation. 

e, Ratio of amplitude (pV) to width (ms); n=5 sleep epochs from 3 animals 

(in vivo; blue), 4 ex vivo brains (red) and 12 slices (green). Lines show the mean. 
f, Autocorrelation function of sharp-wave times, showing that the 
characteristic rhythmic modulation of sharp-wave generation (whichis due to 
the alternation of slow-wave sleep and REM sleep with a 2-3 min period) in 
sleeping animals is absent from both ex vivo brain preparations and slice 
preparations (n=5 sleep epochs from 3 animals (in vivo), 4 ex vivo brains and 12 


200 pA 
200 pA 
200 pV 


V (LFP) J 


500 ms 


slices). g, Whole-cell patch-clamp recording (in current-clamp mode) of aDVR 
neuron (V,,), together with LFP recording in a neighbouring region (V(LFP)) 
witha glass micropipette. Note the simultaneous depolarization of the neuron 
and SWRs, and moderate neuronal depolarization that gives rise to occasional 
firing (three action potentials here). The experiment was repeated with 12 
neurons. h, Whole-cell patch-clamp recording of an amDVR neuron in voltage- 
clamp mode, held at depolarized (blue) and hyperpolarized (red) holding 
potentials (V,,). Note the volleys of excitatory (red) and inhibitory (blue) 
currents at each SWR (LFP), and the near absence of synaptic input in between. 
i, Spike times of a patched amDVR neuron in relation to sharp waves. Note the 
locking to the sharp-wave trough (¢=0), and the absence of firing otherwise 
(n=2amDVRneurons).j, Mean excitatory (g.) and inhibitory (g,;) conductances 
(n=20 and 21events, respectively). The black and grey lines show averaged 
sharp waves recorded with inhibitory and excitatory conductances, 
respectively. Traces are aligned on the sharp-wave trough. 
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Extended Data Fig. 3 | See next page for caption. 
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Extended Data Fig. 3 | Additional single-cell transcriptomic 
characterization. a, UMAP* representation of 20,257 Pogona telencephalic 
cells, colour-coded by cluster. EG, ependymoglial cells; ExcNeur, excitatory 
neurons; InhNeur, inhibitory neurons; MG, microglia; mur, mural cells; NPC, 
neural progenitor cells; olig, oligodendrocytes; OPC, oligodendrocyte 
progenitor cells; prol, proliferating cells; RBC, red blood cells. b, Dot plot 
showing the expression of canonical cell markers (rows) across telencephalic 
cell clusters (columns). The size of the dot corresponds to the percentage of 
cells ina cluster in which the gene has been detected, and the colour represents 
the expression level.c, UMAP representation of 9,777 lizard telencephalic 
neurons, colour-coded by cluster. d,e, UMAP representations of glutamatergic 
(slc17a7) and GABAergic (s/c32a1) neurons in the telencephalon dataset. 

f, Double colorimetric in situ hybridization ina frontal section through the 
anterior Pogona forebrain. Scale bar, 1mm. s/c32a1 (blue) labels GABAergic 
neurons inthe subpallium and scattered GABAergic neurons that have 


migrated from subpallium to pallium. slc17a6 (orange) labels glutamatergic 
neurons inthe pallial region. g, Ordered matrix of pairwise Pearson 
correlations between the expression of 143 ion-channel and neurotransmitter- 
receptor genes detected in this glutamatergic pallial dataset from Pogona (see 
Extended Data Fig. 5). The dendrogram (top) is based on correlation 
coefficients and Ward.D2 linkage; red indicates a gene module with enriched 
expression inthe amDVR.h, Average expression, in the 29 glutamatergic 
Pogona clusters, of the 143 genes ing (and Extended Data Fig. 5). Genes with 
enriched expressionin the amDVRare listed onthe right, with relevant 
neurotransmitter receptor genes in bold. i, UMAP representation of 4,054 
lizard pallial glutamatergic neurons, colour-coded by cluster (same as in 

Fig. 3a).j, Dot plot showing the expression of specific cluster markers (rows) in 
the 29 pallial glutamatergic clusters (columns). The size of the dot corresponds 
to the percentage of cells in a cluster in which the gene has been detected, and 
the colour represents the expression level. 
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Extended Data Fig. 4 | Mini-slices of the DVR and localization of SWR ina. Right, illustrative LFP traces recorded from the amDVR or claustrum (1) 
generation. a, Left, recording configuration of mini-slices of the DVRona and pIDVR (2) (see recording positions on the microelectrode array on the left). 
planar 252-channel microelectrode array. Dots represent electrodes. Right, In conclusion, SWRs occur spontaneously inthe amDVR, and are absent from 
post hocimmunostaining of the mini-slices. Red, Nissl; green, hippocalcin. the pIDVR once it is disconnected from the amDVR (claustrum). 


b, Left, spatial distribution of SWR waveforms as recorded from the mini-slices 
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Extended Data Fig. 5 | lon-channel and neurotransmitter-receptor MRNAS Clusters 19 and 20 (box) correspond to the amDVR or claustrum. They differ by 
inthe glutamatergic cell clusters of the Pogona telencephalon. a, Dot plot the expression of some acetylcholine- and serotonin-receptor subtypes (see 
showing expression of ion-channel and neurotransmitter-receptor genes also Fig. 3h). b, Ordered pairwise Pearson correlation matrix of cluster 

(rows) in Pogona glutamatergic clusters (columns 1-29). The plot shows only transcriptomes, calculated from the expression of the ion-channel and 

genes that were detected in at least 20% of the cells of at least one cluster. The neurotransmitter-receptor genes ina. This gene set is sufficient to distinguish 
size of the dot corresponds to the percentage of cells ina cluster in which the the amDVR clusters (19 and 20) from all of the others. The dendrogram is based 
gene has been detected, and the colour represents the expression level. on Pearsoncorrelations and Ward.D2 linkage. 


Article 


Pome ig SNe” Lot 
Prethal. 
thal VTALSGS Loc 
TMN® = 


/ Sid 


1 2 3456 
7 


J 
Rav 


D 


a-tep 


Vv 


Extended Data Fig. 6 | See next page for caption. 
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Extended Data Fig. 6 | Identification of potential regulatory areas of brain 
states and distribution of GFP-labelled neurons after injection of rAAV2- 
retrointotheclaustrum.a, Left, schematic of the Pogona brain in sagittal 
view, showing the regions defined by immunohistochemistry, in situ 
hybridization and retrograde tracing. Numbers 1-7 indicate the levels of the 
transverse sections that are shown on the right. Right (panels 1-7), micrographs 
and corresponding schematic representations of relevant areas (in red), 
identified by immunohistochemistry, in situ hybridization and Nissl staining. 
Scale bars, 500 pm. Far right of panels 1-7, magnified views of area(s) 
delineated as box(es) in the corresponding photomicrographs. Scale bars, 
100um.b, Identification of rAAV2-retro injection sites. Scale bars, 500 pm. The 
red channel is not shown in the rightmost image.c, Illustrative examples of 
retrograde labelling of claustrum connectivity, in transverse sections. Panels 1, 


2, inputs to claustrum revealed by rAAV2-retro injection in the claustrum. Panel 
1, injection site in lateral claustrum. The claustrum is indicated by anti- 
hippocalcin immunostain (pink). Note retro-labelled cells in the anterior dorsal 
cortex (box, magnified at right). Panel 2, same brain as in1, but a more posterior 
section. The labelled region in the box is the dorsal lateral amygdala. Panels 
3-12, representative images illustrating the distribution of GFP-labelled 
neurons inthe DLPT, DLT, DMT, prethalamus, SUM, mammillary nucleus (MN), 
TMN, VTA, SN, PAG, LoC and SC, with projections to the claustrum. 
Abbreviations asin Fig. 3. The catecholaminergic neuron marker tyrosine 
hydroxylase (TH) was used to indicate the location of the VTA, SN and LoC. 
Scale bars, 500 pm. Scale bars for magnified areas: DLPT, DLT, DMT, 
prethalamus, SUM, MN, TMN, VTA, LoC, 50 um; SN, PAG, SC, 100 pm. 
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Extended Data Fig. 7 | The claustrum of lizard and turtle differ in position 
and architectonics, but are both autonomous sources of SWRs. 

a, Transcriptomic similarity between turtle and lizard clusters, measured as the 
fraction of single cells that mapped from the turtle pallium dataset to the 
Pogona clusters (Methods). Note that the turtle cell clusters e03-e06 (pallial 
thickening; PT) map to the lizard cluster 19 (amDVR or claustrum). Turtle data 
and clusters are froma previous study". b, In situ hybridization in an anterior 
transverse section, showing expression of the pallial thickening marker gene 
crhbp. Scale bar, 500 um.c, Architectonics of the lizard claustrum. Right, 
retrograde labelling of claustrum neurons by rAAV2-retro injected into the 
aDVR. Left, magnification of the boxed area on the right (inthe claustrum). 
Note the disordered distribution of multipolar neurons. Pink colour shows 
anti-hippocalcin immunostaining. Scale bars, 100 pm (left); 500 pm (right). 
d, Architectonics of the turtle claustrum. Right, retrograde labelling of 


claustrum neurons by rAAV2-retro injected into the dorso-medial cortex. Left, 
magnification of the boxed area onthe right. Note the arrangement of bipolar 
neurons within the pallial thickening layer (see also b for layering of pallial 
thickening). Scale bars, 100 pm (left); 500 pm (right). e, Spontaneous sharp 
waves recorded simultaneously in the claustrum and the DVR in turtle slice 
preparation. The red dots in the schematic indicate recording sites. Note sharp 
wave (LFP) and ripple in the high-pass (HP) band. f, Bottom, 295 successive 
spontaneous ripples (high-pass signal intensity (HPI) > 70 Hz) aligned on the 
trough of each sharp wave. Top, average of 295 sharp waves aligned on 
waveform troughs. Grey shading represents s.d. g, Representative cross- 
correlogram of LFP traces recorded simultaneously from the claustrum and 
the DVR (with claustrum as reference), showing the sharp waves fromthe DVR 
trailing those fromthe claustrum. 
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Extended Data Fig. 8 |SWRrecordings and stimulation experiments with 
lizard ex vivo brain preparations. a-f, Experiments in ex vivo brain 
preparation after cortex removal.a, Top, ex vivo brain preparation. Bottom, 
spontaneous SWRs recorded in the claustrum (<150 Hz) (top trace); HP: 70-150- 
Hz filtered LFP, showing ripples (bottom trace). b, Local pressure injection of 
20 uM TTX into the claustrum and post hoc assessment of injection with Evans 
blue (transverse section at the bottom, red).c, Injection of TTX into the 
claustrum (shading) silences sharp-wave activity in the claustrum, but also 
(indirectly) inthe DVR. d, Analysis of four experiments asinc. The filled circles 
represent mean +s.e.m. Claustrum: *P=0.029, T=26 (two-sided Mann- 
Whitney rank-sum test); DVR: *P= 0.029, T=26 (two-sided Mann-Whitney rank- 
sum test). e, Top, average trace and s.d. (shading) from 3,842 sharp waves 
recorded fromthe claustrum of an ex vivo forebrain (alignment on trough). 
Bottom, HPI (>70 Hz) aligned on sharp-wave trough, showing ripple alignment. 
f, Top, simultaneous recordings from ipsilateral claustrum and DVRin an ex 
vivo preparation. Bottom, cross-correlation between simultaneous recordings 
in ipsilateral claustrum and DVR, showing that the claustrum precedes the DVR 
by around100 ms. g, Peristimulus time histogram for multi-unit activity in the 
cortex, inresponse to activation of ipsilateral claustrum in an intact ex vivo 
forebrain. The experiment was carried out innormal ACSF at room 


CLA 
+TITX 40 
= 20 Pe 
& 
£0 
> 
£ DVR 
g 60 
40 
n 
1mm _ | ES 0 
10 min ~ Ctrl = +TTX 
g h 
203 
Qa — 
2 @ 
5 = 
‘= o 
3 0 3 
2 
oe 2507 Ny i, ne Aa Wa & & 
rd I i 1 ac ~~) 
a ye OLE at ur | ® 
-1000 0 1000 & Na 
: . € OO 
Time from CLA stim (ms) 8 GO 


temperature in the presence of 30 pM serotonin to suppress spontaneous 
SWRs in the claustrum and 50 uM carbachol to raise cortex excitability. The 
claustrum stimulus consisted of a single 50-ps electrical pulse, delivered witha 
bipolar electrode. Cortex multi-unit activity was recorded witha glass 
micropipette. h, Change in cortical firing rate (FR) measured ina200-ms bin 
after the claustrum stimulus versus a 200-ms bin before the stimulus (asing, 
oneach side of t=0). The control column plots the firing-rate ratio measured in 
the experiment ing, and the GBZ +CGP column plots the results of the same 
experiment after addition of the GABA receptor antagonists gabazine (GBZ; 
5M) and CGP52432 (CGP; 2 uM); n=4 ex vivo brains from 3 animals each. The 
control experiment shows that stimulation of the claustrum has an immediate 
and reliable inhibitory effect on the cortex (#: significantly different from 
baseline, P= 0.017, t,=4.8 (two-sided paired ¢-test)). The stimulation 
experimentin GABA receptor antagonists shows that stimulation of the 
claustrum nowslightly excites the cortex (**: significantly different from 
control, P=2.0 x 10°, t,=—5.22 (two-sided Student’s f-test)), suggesting that 
projections from the claustrum both activate and inhibit cortical neurons, 
probably via direct excitatory projections and indirect inhibitory ones through 
interneurons (see rodent experiments ina previous study”’). Short horizontal 
lines indicate mean. 
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Extended Data Fig. 9 | Further analysis of in vivo ibotenic-acid-induced 
lesion experiments in sleeping Pogona.a, Autocorrelation (top) and cross- 
correlation (bottom) of B-band activity in the left and right DVR during sleepin 
an animal with bilateral claustrum lesions (lesions are shownind). Note thata 
periodic sleep rhythm (period of around 3 min here) remains after claustrum 
lesions and therefore does not seem to depend onclaustrum integrity. b,c, 
Sameasa, but with unilateral ibotenic-acid-induced lesion in two animals (land 


500 pm 500 pm 
II). The non-lesioned (sham) side was injected with the same volume of PBS 
vehicle but without ibotenic acid. Dotted line, sham; solid line, lesion. d, Nissl 
stains (1-3) of transverse sections of the brain of animals with bilateral lesions 
(shownalso in Fig. 4b), at levels indicated in the schematic on the left. Note the 
claustral lesions (arrows in 1), which are visible as cell-body loss, and the 
recording sites in left (2) and right (3) DVRs (dotted circles). 
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Extended Data Fig. 10 | Further data on serotonergic projectionsto 
claustrum and their effects on the generation of sharp-wave ripples. 

a, Transverse section of claustrum double-labelled with DAPI (blue, nuclei) and 
serotonin (axonal fibres) antibodies. Note the dense meshwork of serotonergic 
fibres. Scale bar, 50 pm. b, Frequency of spontaneous SWRs in claustrum mini- 
slices as a function of superfused serotonin concentration. Red circles 
represent individual experiments (slices). Black points and lines are 
mean+s.e.m. 
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Data collection in vivo data were collected using Cheetah (Neuralynx) and ex vivo and slice recordings were collected with pClamp 10.5 (Molecular 
Devices), patchmaster v2x90 (Heka), BrainWave v.4.1. (3Brain), and MC_Rack (Multichannel Systems). For the brain reconstruction we 
used UCT scanner and Imaris software (Oxford Instruments), confocal images were taken with Zen 2.1 software (Carl Zeiss). 


Data analysis Custom written code (MatLab 2016a-2017b) was used anayze physiological data. Raw sequencing data were processed using Cellranger 
v3.0 (10xGenomics), and digital gene expression matrices were analyzed in R, using the Seurat v3.0 package. Reconstruced brain images 
were analyzed with the Voloom 3.0 (micro Dimensions) and Imaris 9.2 (Oxford Instruments). 
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Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No statistical tests were used to predetermine sample sizes. The amount of brain samples used for each experiment 
was chosen based on previous experience with this specific type of experiments and commonly used sample sizes in this field of 
research, taking into account the unusual nature and limited availability of the animal species studied. 


Data exclusions Experiments with off-target placement of electrodes or viral infection were excluded from our analyses. For transcriptomic analyses, only cells 
were considered for which the number of detected genes was >800 genes/cell (> 1000 for glutamatergic neurons), and the percentage of 
mitochondrial genes <5%/cell, as detailed in the methods. The exclusion criteria was pre-established. 


Replication We observed similar results which satisfied the same statistical criteria across experiments and we could replicate all our results. 


Randomization Animals were not assigned to groups, and were selected based on weight, health and temperament (for in vivo experiments). 
Randomization was not relevant for our study. 


Blinding Investigators were not blinded to group allocation during data collection and analysis. Our study was mostly observational in nature with the 
exception of pharmacological experiments (Fig. 6), comparing the effect of neuromodulators, ibotenic acid, and TTX injections on sleep EEG 
and SWR production. Measurement of the effect were fully automated, and blinding was thus not relevant for our study. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


[| Human research participants 


[| Clinical data 


Antibodies 


Antibodies used Standard, commercially available antibodies were used. Primary antibodies were anti-GFP (A10262, invitrogen, chicken, 1:1000); 
anti-Hippocalcin (ab24560, abcam, rabbit, 1:1000); anti-ChAT-choline acetyltransferase (AB144P, Merk, goat, 1:100); antiTH- 
tyrosine hydroxylase (22941, Immunostart, mouse, 1:100 or AB152, Merk, rabbit, 1:200); anti-Histamine (22939, Immunostart, 
rabbit, 1;100); and anti-Serotonin (MAB352, Merk, rat, 1:100); secondary antibodies were Donkey or Goat anti-rabbit, chicken, 
goat, mouse, or rat, conjugated with Alexa-488, 568, 647 (A21206, A21208, A11039, A11011, A11057, A11004, A31573, A21247, 
A31571, all from Invitrogen, all 1:500) 


Validation The manufacturer validated the antibody by Western dot and dot blot. IHC validation was performed in our laboratory, testing 
various concentrations on lizard tissue. 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Lizards (Pogona vitticeps, either sex, adult (100-400g)) were bred in-house or obtained from external breeders. 
Turtles (Trachemys scripta elegans or Chrysemys picta, either sex, adult (200-400g)) were obtained from an open-air breeding 
colony (NASCO Biology, WI, USA), and all species used were housed in our state-of-the-art animal facility. 
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Wild animals This study did not involve wild animals. 
Field-collected samples This study did not involve field-collected samples. 


Ethics oversight All experimental procedures were performed in accordance with German animal welfare guidelines: permit #V54- 19c 20/15- 
F126/1005 delivered by the Regierungspraesidium Darmstadt, Germany (Dr. E. Simon). 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 
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ATP13A2 (PARK9) is a late endolysosomal transporter that is genetically implicated in 
aspectrum of neurodegenerative disorders, including Kufor-Rakeb syndrome—a 
parkinsonism with dementia'—and early-onset Parkinson's disease’. ATP13A2 offers 
protection against genetic and environmental risk factors of Parkinson’s disease, 
whereas loss of ATP13A2 compromises lysosomes*. However, the transport function 
of ATP13A2 in lysosomes remains unclear. Here we establish ATP13A2 as alysosomal 
polyamine exporter that shows the highest affinity for spermine among the 
polyamines examined. Polyamines stimulate the activity of purified ATP13A2, whereas 
ATP13A2 mutants that are implicated in disease are functionally impaired to a degree 
that correlates with the disease phenotype. ATP13A2 promotes the cellular uptake of 
polyamines by endocytosis and transports them into the cytosol, highlighting a role 
for endolysosomes in the uptake of polyamines into cells. At high concentrations 
polyamines induce cell toxicity, which is exacerbated by ATP13A2 loss due to 
lysosomal dysfunction, lysosomal rupture and cathepsin B activation. This phenotype 
is recapitulated in neurons and nematodes with impaired expression of ATP13A2 or its 
orthologues. We present defective lysosomal polyamine export as a mechanism for 
lysosome-dependent cell death that may be implicated in neurodegeneration, and 
shed light on the molecular identity of the mammalian polyamine transport system. 


ATP13A2 is a PSB-ATPase that belongs to the family of P-type ATPases, 
which couple ATP hydrolysis to substrate transport while transiently 
forming a catalytic phospho-intermediate*. ATP13A2 is generally 
described as a heavy-metal transporter’, but Ca** (ref. °) and the 
polyamine spermidine (SPD)’* have also been proposed as potential 
substrates. To screen for the transported substrate(s) of ATP13A2, we 
measured ATPase activity in the presence of various candidate sub- 
strates in solubilized microsomal membrane fractions of SH-SYSY cells 
that overexpress wild-type human ATP13A2 (hereafter denoted WT-OE) 
or comparable levels of the catalytically dead mutant ATP13A2(D508N) 
(containing an aspartic-acid-to-asparagine mutation at position 508; 
denoted D508N-OE)?"°. 

The ATPase activity of wild-type ATP13A2 was significantly stimu- 
lated by the polyamines SPD and spermine (SPM) (Fig. 1a), whereas 
SPM had no effect on the activity of the DSO8N mutant (Extended Data 
Fig. 1a). MnCl,, ZnCl,, FeCl,, CaCl,, diamines, monoamines and amino 


acids exerted no effect on activity (Extended Data Fig. la-d), but the 
polyamines SPM, N'-acetylspermine and SPD stimulated the ATPase 
activity of ATP13A2 in a concentration-dependent manner (Fig. 1b, 
Extended Data Fig. le), with the highest apparent affinity observed 
for SPM (Extended Data Table 1). 

The catalytic autophosphorylation and/or dephosphorylation reac- 
tions of P-type ATPases occur in response to binding of the transported 
substrate*. ATP13A2 forms a phospho-intermediate on the D508 residue 
in the absence of SPM supplementation”, whereas the addition of SPM 
leads to a dose-dependent reduction in the levels of the ATP13A2 phos- 
phoenzyme (Fig. Ic), which is not seen with ornithine (Extended Data 
Fig. 1f). The dephosphorylation rate after a chase with non-radioactive 
ATP increased in the presence of SPM (Fig. 1d), which further indicates 
that SPM could be the transported substrate. 

SPM also stimulated the ATPase activity of purified human ATP13A2 
(for purification, see Extended Data Fig. 2a—e). However, purified 
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Fig. 1| ATP13A2 isa polyamine transporter. a, Chemical structures of 
ornithine (ORN), putrescine (PUT), spermidine (SPD) and spermine (SPM). 
b-d, Measurements on solubilized SH-SYSY microsomes overexpressing 
ATP13A2 (WT-OE). b, Dose-response curves showing the effect of SPM, SPD, 
PUT and ORN onthe ATPase activity of ATP13A2.c, ATP13A2 phosphoenzyme 
(EP) levels in WT-OE cells in the presence of increasing SPM concentrations. 
Top, representative autoradiogram; bottom, quantification of EP. d, Pulse 
([y-2P]ATP) chase (cold ATP) of dephosphorylation in the presence and 
absence of 1mM SPMin WT-OE microsomes. e, ATPase activity of purified 
ATP13A2 under increasing concentrations of SPM (in the presence and absence 
of phosphatidic acid (PA) and/or PtdIns(3,5)P,).f, Dose-response curves 
showing the effect of SPM, SPD and PUT onthe ATPase activity of purified 
ATP13A2 supplemented with phosphatidic acid and PtdIns(3,5)P, (asa 
reference, SPM + phosphatidic acid + PtdIns(3,5)P, from eis also shown). 

g, Pulse ([y-*P]ATP) chase (cold ATP) of dephosphorylation measured for 


ATP13A2 presented similar properties to microsomal ATP13A2 only 
in the presence of the regulatory lipids phosphatidylinositol(3,5)bis- 
phosphate (PtdIns(3,5)P,) and phosphatidic acid, which bind to the N 
terminus of ATP13A2° " (Fig. le, f, Extended Data Fig. 2f, Extended Data 
Table 1). SPM-induced ATPase activity was blocked by orthovanadate, a 
general P-type ATPase inhibitor (Extended Data Fig. 1g). Finally, we also 
purified the mutant ATP13A2(E343A), which carries a mutation in the 
conserved catalytic motif for dephosphorylation @““TGES) (Extended 
Data Fig. 2g). The E343A mutant underwent autophosphorylation 
(Fig. 1g) but displayed limited SPM-induced ATPase activity (Extended 
Data Fig. 2h). Notably, when the phosphoenzyme was chased with cold 
ATP, SPM clearly stimulated dephosphorylation of purified wild-type 
ATP13A2, but not of the E343A mutant (Fig. 1g, Extended Data Fig. 1h, 6a). 


ATP13A2 isa lysosomal polyamine exporter 


Next, we performed transport assays with 7H-labelled SPM @H-SPM) 
in reconstituted vesicles from solubilized yeast membranes express- 
ing biotin acceptor domain (BAD)-labelled wild-type ATP13A2 or the 


420 | Nature | Vol578 | 20 February 2020 


purified ATP13A2 (wild-type or E343A mutant) inthe presence or absence of 1 
mMSPM.h, Illustration of vesicle reconstitution and the>H-spermine ?H-SPM) 
transport assay. i, Immunoblot of reconstituted vesicles. j, Uptake of H-SPM in 
reconstituted vesicles derived from yeast overexpressing BAD-tagged 
ATP13A2 (wild-type or E343A mutant) supplemented with phosphatidylcholine 
and phosphatidic acid, inthe presence or absence of intraluminal ATP and an 
ATP-regenerating system. Data are presented as mean +s.e.m.inb-g, or as box 
and whisker plots with overlaid individual data points representing replicates 
(j; horizontal line, median; box boundaries, 25th and 75th percentiles). The 
number of independent biological experiments were as follows: n=3 b(ORN, 
SPD), d-g (E343A), i,j (wild-type ATP outside; E343A);n=4c,e(nolipid);n=5b 
(PUT), g (wild-type); n = 6j (wild-type no ATP), b (SPM); n=8j (wild-type 

ATP inside). Analysis by two-way (g) or one-way (j) ANOVA with Tukey’s 
corrections. Fitted lines indicate nonlinear allosteric sigmoidal (b, e, f) or two- 
phase (c, d, g) decay. For gel source data, see Supplementary Fig. 1. 


mutant ATP13A2(E343A), supplemented with the activating lipid 
phosphatidic acid. This reconstitution rendered two populations of 
ATP13A2 proteins that were inserted either right-side-out (ATP-binding 
domainin the extraluminal space) or inside-out (ATP-binding domain 
inthe lumen) (Fig. 1h, i). Uptake of 3H-SPM was detected for wild-type 
ATP13A2—but not for the E343A mutant—only when ATP was present 
inside the vesicles, together with an ATP-regenerating system (Fig. 1j); 
that is, when ATP13A2 was positioned inside-out ((2) in Fig. 1h). Extend- 
ing these insights to the cellular context, in which ATP13A2 is present 
in the late endolysosomal compartment!”, ATP13A2 most probably 
operates as a lysosomal SPM exporter. 

Notably, the functionality of ATP13A2 affects the cellular polyamine 
content. We generated two independent ATP13A2-knockout SH-SYSY 
cell lines using CRISPR-Cas9 genome editing (denoted KO; Extended 
Data Fig. 3a) and, upon analysis by mass spectrometry, demonstrated 
that the total cellular polyamine content was lower in KO than in con- 
trol cells (Fig. 2a). Expression of either the wild-type ATP13A2 in the 
KO background (denoted KO/WT) for rescue, or the DSO8N mutant 
(KO/DSO8N) as a negative control (Extended Data Fig. 3b), resulted in 
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Fig.2| ATP13A2 transport affects cellular polyamine uptake, whichis 
impaired by catalytic and disease mutations. a, b, Metabolomics of cellular 
polyamines in ATP13A2 knockout cells (KO) compared with SH-SYSY controls 
(a), or compared with rescue cell lines expressing wild-type ATP13A2 (KO/WT) 
or ATP13A2(D508N) (KO/D508N) (b).c, Uptake of BODIPY-SPM in the presence 
or absence of the endocytosis inhibitors Dynasore, genistein and/or Pitstop 2, 
alone or in combination (combo). d, Confocal microscopy of BODIPY-SPM 
distribution (see Methods). Scale bar, 5 um. The arrow head shows the region 
that is expanded in the insets; the dashed line shows the region analysed in 
Extended Data Fig. lj. e, f, Immunoblotting (e) and SPM-induced ATPase 
activity (f) of microsomes expressing wild-type ATP13A2, the DSO8N mutant, 
or catalytic mutants in M4 (A467V), M6 (D962N) and M8 (K1062A)””. 

g, Representative autoradiogram depicts phosphoenzymes of wild-type 
ATP13A2 compared with the indicated mutants in the presence or absence of 
1mMSPM.h, Uptake of BODIPY-SPM in cells expressing the indicated mutant 
proteins. i,j, Immunoblotting (i) or SPM-induced ATPase activity (j) of SH-SYSY 
microsomes overexpressing mutants associated with Kufor-Rakeb syndrome 
(T5121 and G872R) or early-onset Parkinson’s disease (T12M, G528R, A741T). 


asignificantly higher SPD and SPM content in KO/WT cells than in KO 
or KO/DSO8N cells (Fig. 2b). 

Flow cytometry experiments revealed that ATP13A2 promotes the 
cellular uptake of BODIPY-labelled polyamines”. WT-OE cells took up 
more BODIPY-SPD and BODIPY-SPM than D508N-OE cells or control 
cells that expressed firefly luciferase (Fluc; Extended Data Fig. 4a, b). 
Similarly, KO/WT cells displayed a twofold higher uptake of BODIPY- 
SPM than ATP13A2 KO and KO/DS50O8N cells, both of which took up less 
than control cells (Fig. 2c). The comparable uptake of the fluorescein 
isothiocyanate (FITC)-—dextran conjugate in the KO/WT and the KO/ 
DSO8N cells (Extended Data Fig. 4c) demonstrated that the higher 
BODIPY-SPM uptake in KO/WT cells compared with KO/DSO8N cells 
is not explained by an increased endocytic rate, but depends on the 
transport activity of ATP13A2. The stimulatory effect of ATP13A2 on 
endocytosis appears as a transport-independent phenotype”. 

The observation that ATP13A2 transports SPM towards the cytosol 
(Fig. 1j) and increases the cellular SPM content (Fig. 2a, b) suggests 
that it could transport endocytosed polyamines into the cytosol. 
Indeed, endocytosis inhibitors prevented the uptake of FITC-dextran 


SPM (mM) 


The SPM dose-response curve from Fig. 1b is shownas a reference infandj. 

k, Representative autoradiogram depicting phosphoenzymes of wild-type 
ATP13A2 compared with the disease-related mutants in the presence and 
absence of 1mM SPM.1I, BODIPY-SPM uptake in cells containing the indicated 
mutant proteins. MFI, mean fluorescence intensities; (—), vehicle-treated 
sample. Data are presented as individual data points (representing replicates) 
overlaid on box and whisker plots (a, b, horizontal line, median; box 
boundaries, 25th and 75th percentiles) or mean (c,h, 1) or mean+s.e.m. (f,j). 
The number of independent biological experiments were as follows: n=3 d-f, 
g (D508N, A467V, D962N, K1062A (SPM)), h,j (T12M, T5121, G872R), k (T12M (-), 
T12M, T5121, G528R, A741T, G872R (SPM)), I; n=4 a-c (KO, KO/WT, KO/DN), 

g (D508N, A467V, D962N, K1062A (-)), i,j (A741T, G528R), k (wild-type, T5121, 
G528R, A741T, G872R (-));n=5c (control), g (wild-type (-)); n= 6g (wild-type 
(SPM)), k (wild-type (SPM)). Analysis was performed using one-way ANOVA with 
Tukey’s (a—c) or Dunnett’s (h-I) corrections. Fitted lines indicate nonlinear 
allosteric sigmoidal (f (D962N, DSO8N),j) or one-phase (f (A467V, K1062A)) 
association. For gel source data see Supplementary Fig. 1. 


(Extended Data Fig. 4c) and blocked that of BODIPY-SPM (Fig. 2c). Using 
confocal microscopy, we confirmed that KO/WT cells had a higher 
BODIPY-SPM content than KO/DSO8N cells (Extended Data Fig. li). 
Inthe KO/D508N cells, BODIPY-SPM mainly colocalized with LAMP1- 
positive vesicles (Fig. 2d, Extended Data Fig. 1j, k); this is indicative 
of accumulation in the late endolysosomes, which is not a lysosomo- 
tropic effect’. By contrast, cells with functional ATP13A2 displayed a 
broader distribution of BODIPY-SPM, and it was more abundant in 
the cytosol and nucleus; this is consistent with the transport direction 
from lysosomal lumen to cytosol (Fig. 2d, Extended Data Fig. 1l). By 
stimulating cellular SPM uptake and transporting SPM into the cytosol, 
ATP13A2 complements endogenous SPM synthesis, which depends 
onthe enzymes ornithine decarboxylase (ODC) and the SPD and SPM 
synthases. Consistent with this, KO/WT cells were protected against 
pharmacological inhibition of ODC and the SPD and SPM synthases, 
and a lack of ATP13A2 activity sensitized KO and KO/DSO8N cells to 
inhibition (Extended Data Fig. 4d-g); this is consistent with the nega- 
tive genetic interactions between ATP13A2 orthologues and ODC in 
yeast" and Caenorhabditis elegans®. 
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Fig. 3 | ATP13A2 protects against lysosome-dependent SPM toxicity. 
a, SPM-induced cell death (PI, propidium iodide) in SH-SYSY control, ATP13A2 
knockout (KO) and rescue cell lines with wild-type ATP13A2 (KO/WT) or 
ATP13A2(D508N) (KO/D508N). b-f, Effect of SPM (10 1M, 4h) and acidic 
nanoparticles (NP, 100-nm diameter; Extended Data Fig. 8j) on lysosomal 
functionality. b, Lysosomal pH measured by ratiometric FITC-dextran 
(standard curve, Extended Data Fig. 8i). c,d, Lysosomal degradation capacity 
analysed by DQ-BSA (c) or cathepsin B (CTSB) (d) activity. e, f, Assessment of 
lysosomal membrane integrity (LMI) via acridine orange (AO) staining (e), or 
galectin-3 (Gal-3) punctae formation (lysosomal rupture) (f). Inf, the confocal 
images depict representative images (DAPI staining for nuclei reference; scale 


Mutations disrupt SPM-induced ATP13A2 activity 


We next used mutagenesis techniques to confirm that the SPM-depend- 
entactivation of ATP13A2 depends on residues in the predicted substrate- 
binding site near transmembrane segment M4. We introduced mutations 
intransmembrane segments M4 (A467V), M6 (D962N) and M8 (K1062A) 
(Fig. 2e, Extended Data Fig. 5a, b; see Methods for the rationale in choos- 
ing these residues). In comparison with the wild type, the three mutants 
displayed alower SPM-induced ATPase activity and apparent affinity for 
SPM (Fig. 2f), as well as a reduced cellular uptake of BODIPY-SPD and 
BODIPY-SPM (Extended Data Fig. 5c, Fig. 2h), which suggests that these 
residues contribute to SPM coordination inthe membrane region. SPM- 
induced dephosphorylation was completely abolished in the D962N 
mutant (Fig. 2g, Extended Data Figs. 5d, 6), indicating that D962 may 
couple SPM binding to the dephosphorylation reaction. 

More than thirty disease-associated mutations have been identified 
in ATP13A2 (Extended Data Fig. 7). We determined the activity of mainly 
lysosomal-localized mutants of ATP13A2 arising from point mutations 
in ATP13A2 that are linked to early-onset Parkinson’s disease or Kufor- 
Rakeb syndrome. The T5121 and G872R mutants—associated with Kufor- 
Rakeb syndrome—did not exhibit ATPase or autophosphorylation 
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bars, 2.5 um). The box and whisker plots depict the size (left) and number 
(right) of punctae. g,h, The effect of nanoparticles (g) or the cathepsinB 
inhibitor CA-074 (25 1M) (h) on SPM-induced cytotoxicity. Data are presented 
as individual data points (representing replicates) overlaid on box and whisker 
plots (b-f, horizontal line, median; box boundaries, 25th and 75th percentiles) 
or means (g,h) or mean +s.e.m. (a). The number of individual biological 
experiments were as follows: n=3.a,b-d (NP, NP+SPM),f-h;n=4e, 

(NP, NP +SPM);n=6b-d ((-), SPM); n=7 e ((-), SPM). Analysis was performed 
using two-way ANOVA with Dunnett’s test (a) or one-way ANOVA with Tukey’s 
(b-e, g, h) or Sidak’s (f) test. Fitted lines indicate nonlinear log(inhibitor) 
versus response (variable slope) (a). 


activity and were SPM-insensitive, which is in line with their strongly 
reduced BODIPY-SPD and BODIPY-SPM uptake (Fig. 2i-l, Extended 
Data Fig. Se-g). The T12M, G528R and A741T mutants—associated with 
early-onset Parkinson’s disease—had aless severe effect. Compared with 
wild type, a reduction in the apparent SPM affinity was observed and 
phosphoenzyme levels were more (T12M) or less (G528R and A741T) 
sensitive to SPM. In cells, the uptake of BODIPY-SPD and BODIPY-SPM 
was less severely impaired with the mutations associated with early- 
onset Parkinson’s disease than those associated with Kufor-Rakeb 
syndrome (Fig. 2i-I, Extended Data Fig. Se-g). In summary, ATP13A2- 
dependent polyamine transport is disturbed in all mutants tested, 
and the degree of functional impact correlates with the phenotypic 
differences between early-onset Parkinson’s disease and Kufor-Rakeb 
syndrome; however, the mutation type is not the sole determinant of 
the clinical phenotype”. 


ATP13A2 protects against polyamine toxicity 


We further investigated whether defective lysosomal polyamine export 
explains the lysosomal phenotype in ATP13A2 KO cells?. SPM and SPD 
are abundant organic polycations that support cell function, but at 
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Fig. 4| Loss of ATP13A2 orthologues exacerbates the toxicity of polyamines 
in primary neurons and in C. elegans. a, Lentiviral knockdown of Atp13a2in 
isolated mouse cortical neurons (miR-3 or miR-5) was confirmed by 
quantitative PCR relative to Gapdh. miRNA against firefly luciferase (miR-Fluc) 
was used as a negative control. b, miR-Fluc and Atp13a2 knockdown neurons 
were transduced with Fluc, wild-type human ATP13A2 or ATP13A2(D508N). 
SPM-induced cytotoxicity was assayed via TUNEL staining. Left, representative 
confocal images depicting TUNEL-positive cells; right, box and whisker plots 
showing quantification of the TUNEL staining. c, The indicated worm strains 
were assessed for SPD toxicity. Worm lengths were determined as aread-out 


high concentrations they become toxic’®. Consistent with this, high 
levels of SPM or SPD reduced the viability of control cells after 24 hours 
(Extended Data Fig. 8a, b), which was paralleled by an increase in cell 
death (Fig. 3a); ornithine and putrescine were not cytotoxic (Extended 
Data Fig. 8c, d). Notably, loss of ATP13A2 activity exacerbated the tox- 
icity of SPM and SPD (Fig. 3a, Extended Data Fig. 8a, b), which may be 
a direct consequence of lysosomal polyamine accumulation (Fig. 2d) 
leading to lysosomal dysfunction. At atime point preceding cell death 
(4 hours; Extended Data Fig. 8e), lysosomal acidification was compro- 
mised in KO and KO/DSO8N cells, which was aggravated upon SPM 
exposure (Fig. 3b). This pH-neutralizing effect of SPM was absent in 
control and KO/WT cells (Fig. 3b). 

Lysosomal alkalization may explain the decreased lysosomal- 
degradation potential (Fig. 3c) and cathepsin D activity (Extended 
Data Fig. 8f) that is observed in KO and KO/D508N cells. However, the 
activity of cathepsin B increased at toxic SPM levels (Fig. 3d), in line 
with its higher pH optimum and most likely due to impaired lysosomal 
membrane integrity”, which is a driver of lysosome-dependent cell 
death”°. We then confirmed, using an acridine orange-based assay, 
that lysosomal membrane integrity in ATP13A2 KO and KO/D508N 
cells was impaired; this was more prominent after SPM challenge, 
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for toxicity. Scale bar, 100 pm. Top, quantification; bottom: representative 
images. d, Illustration of the proposed mechanism of endolysosomal 
polyamine uptake and transfer into the cytosol via ATP13A2. (-), vehicle- 
treated sample; EE, early endosome; LE, late endosome; LYS, lysosome. Data 
are presented as mean (a) or as box and whisker plots (b, c, horizontal 

line, median; box boundaries, 25th and 75th percentiles) with individual data 
points representing replicates. The number of independent biological 
replicates were as follows: n =2 (a);n=3 (b,c). Analysis was performed using 
one-way ANOVA with Tukey’s test (b, c). 


aphenotype that was absent in control and KO/WT cells (Fig. 3e). SPM 
treatment also increased the number and size of endogenous galectin 
3 punctae, representative of lysosomal rupture, in KO and KO/D508N 
cells only (Fig. 3f). Lysosomal rupture was further confirmed by the loss 
of endolysosomal FITC-dextran punctae anda more diffuse, cytosolic 
cathepsin B staining in the KO and KO/DSO8N cells (Extended Data 
Fig. 8g, h). Inline with reported findings”, exposure to acidic nanopar- 
ticles restored lysosomal pH and functionality (Fig. 3b, c), prevented 
cathepsin B activation (Fig. 3d) and recovered the intactness of lysoso- 
mal membranes (Fig. 3e), ultimately reducing SPM-induced cell death 
(Fig. 3g). Moreover, pharmacological inhibition of cathepsin B signifi- 
cantly reduced SPM toxicity in the KO and KO/D5O8N cells (Fig. 3h). 


ATP13A2is protective in higher disease models 


Lysosomal polyamine toxicity may be relevant in the context of neuro- 
degeneration, because isolated mouse cortical neurons with miRNA- 
mediated Atp13a2 knockdown (Fig. 4a) were more susceptible to 
SPM-induced cell death than control neurons (Fig. 4b). Notably, the 
increased sensitivity to SPM observed for neurons in which Atp13a2 
was knocked down was attenuated either by inhibition of cathepsin B 
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(Extended Data Fig. 9) or by rescue with wild-type human ATP13A2, but 
not by the DSO8N mutant (Fig. 4b). 

Finally, SPD exposure was found to hamper growth capacity in 
C. elegans N2 (wild-type); this was marked by developmental delay and 
hence reduced worm length. These effects were exacerbated in worms 
deficient in ATP13A2 orthologues (catp-5(0), catp-6(O) or catp-7(0)””) 
(Fig. 4c). The phenotype of CATP-6- and CATP-7-deficient strains was 
rescued by wild-type CATP-6 or CATP-7, but not by acatalytically dead 
variant, showing that the transport activity is required (Fig. 4c). 


Discussion 


Polyamines are physiologically important polycations that are tightly 
regulated by a complex interplay of import, export, synthesis and 
degradation”. As a polyamine transporter that controls the cellular 
polyamine content, ATP13A2 emerges as amember of the mammalian 
polyamine transport system”. Extracellular polyamines most prob- 
ably bind to plasma membrane heparan sulfate proteoglycans” and 
enter the cell through endocytosis” before transport into the cytosol 
by ATP13A2 (Fig. 4d). In addition, the other PSB-ATPases (ATP13A3, 
ATP13A4 and ATP13AS) that reside in the endosomal system share high 
sequence similarity in the substrate-binding region”, and may belong 
to the mammalian polyamine transport system. 

Genetic insights suggest a major role for lysosomal dysfunction in 
Parkinson’s disease, contributing to a-synuclein aggregation and mito- 
chondrial dysfunction*’. Here we demonstrate that impaired lysosomal 
polyamine export represents alysosome-dependent cell death pathway 
that may be implicated in ATP13A2-associated neurodegeneration. In 
addition, defective ATP13A2 leads toa reductionin cellular polyamine 
content, which may potentiate the disease phenotype because polyam- 
ines are scavengers of heavy metals and reactive oxygen species” and 
regulate autophagy”. The dual effect of ATP13A2 on both lysosomal 
and cytosolic polyamine levels may explain the broad phenotype that 
is associated with its loss of function®. Other genes that are related to 
Parkinson’s disease may also affect ATP13A2 functionality, or may be 
affected by disrupted polyamine homeostasis. Polyamine levels decline 
with age, whereas polyamine supplementation increases lifespan in 
several model organisms”. Conversely, defective SPM synthase causes 
Snyder-Robinson syndrome—a form of X-linked intellectual disabil- 
ity’°—and reduced expression of SPD/SPM N-acetyltransferase 1 has 
been implicated in Parkinson’s disease”. Modulation of polyamine 
homeostasis may therefore be considered for neuroprotective therapy. 

In conclusion, ATP13A2 dysfunction prevents late endolysosomal 
polyamine export and sensitizes cells to lysosomal disruption by exog- 
enous polyamines. 
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Methods 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and the investigators were not 
blinded to allocation during experiments and outcome assessment. 


Materials 

The following reagents were purchased from Sigma-Aldrich: sodium 
orthovanadate (S6508), CaCl, (C3881), ZnCl, (Z0152), MnCl, (M3634), 
FeCl, (157740), SPM (S3256), SPD (S2626), N'-acetylspermine trihy- 
drochloride (01467), M'-acetylspermidine hydrochloride (9001535-1), 
N®-acetylspermidine dihydrochloride (A3658), putrescine dihydro- 
chloride (P7505), L-arginine (A5006), L-ornithine monohydrochloride 
(02375), histamine (H7250), agmatine sulfate salt (A7127), dopamine 
hydrochloride (H8502), cadaverine (D22606), yeast nitrogen base 
without amino acids (Y0626), yeast drop-out mix without uracil (Y1501), 
glucose (G8720), streptavidin sepharose (GE17-5113-01), thrombin 
(GE27-0846-01), DMSO (276855), DL-a-difluoromethylornithine (DFMO; 
D193), Dynasore (D7693), Pitstop 2 (SML1169), 4-methylumbelliferyl 
heptanoate (MUH, M2514), propidium iodide (P4170), CA-074 (C5732), 
fluorescein isothiocyanate-dextran (FITC-dextran, 46945), DAPI 
(D9542), anti-ATP13A2 antibody (A3361), anti-GAPDH antibody (G8795), 
Resomer RG 503H (719870) and SigmaFast protease inhibitor (S8820). 
In addition, 18:1 PtdIns(3,5)P, (1,2-dioleoyl-sn-glycero-3-phospho-(1’- 
myo-inositol-3’,5’-bisphosphate) (ammonium salt); 850154), 18:1 
phosphatidic acid (1,2-dioleoyl-sn-glycero-3-phosphate (sodium salt); 
840875) and egg phosphatidylcholine (840051) were obtained from 
Avanti Polar Lipids. Bovine serum albumin (BSA; 3854.3) was obtained 
from C. Roth. Yeast extract (103753.0500) was purchased from VWR, 
and n-dodecyl-B-D-maltopyranoside (DDM; 1758-1350) was purchased 
from Inalco. We obtained Bio-Beads SM-2 resin (1523920) from Bio-Rad. 
3H-SPM (ART 0471) was ordered from ARC. APCHA (N-(3-aminopropyl) 
cyclohexylamine; sc-202715) and 4MCHA (cis-4-methylcyclohexy- 
lamine; sc-272662) were purchased from Santa Cruz Biotechnology. 
Genistein (ab120112), anti-galectin-3 antibody (ab2785), anti-LAMP1 
antibody (ab24170) and anti-cathepsin B antibody (ab58802) were 
purchased from Abcam. TRYPLE (12604021), AO (A1372) and DQ-Green 
BSA (D12050) were ordered from Life Technologies. 

HEK-293T cells were purchased from ATCC and certified by ATCC 
via STR genotype analysis. SH-SY5Y cells were purchased from ATCC 
and certified by ATCC via STR genotype analysis. SH-SYSY cells from 
an in-house collection were authenticated via DNA fingerprinting 
(Leibniz-Institut DSMZ-Deutsche Sammlung von Mikroorganismen 
und Zellkulturen GmbH). 


Preparation of compounds and inhibitors 

All polyamines, diamines, monoamines and amino acids were pre- 
pared to a final stock concentration of 500 mM (200 mM in the case 
of SPM) in 0.1 M MOPS-KOH (pH 7.0). DFMO was prepared to a final 
stock concentration of 500 mM in Milli-Q H,O. The inhibitors 4MCHA 
and APCHA were dissolved in DMSO to a final stock concentration of 
200 mM. The endocytosis inhibitors Dynasore, genistein and Pitstop 2 
were dissolved in DMSO to final concentrations of 50 mM, 25 mM and 
25 mM, respectively. The cathepsin B inhibitor CA-074 was dissolved 
in DMSO to a final concentration of 25 mM. 


Generation of SH-SYSY cell models 

SH-SYSY human neuroblastoma cells were transduced with lentiviral 
vectors to obtain stable overexpression of firefly luciferase (Fluc) or 
human ATP13A2 (isoform 2, wild-type (ID: NP_001135445), indicated dis- 
ease or catalytic mutants) and maintained as described previously? ™. 
The catalytic mutants A467V on M4, D962N on M6 and K1062A on M8 
were generated by mutagenesis. P5-type ATPases were discovered by 
genome sequence analysis 20 years ago and contain highly conserved 


motifs for function and substrate binding™. The A467V mutation con- 
verts PPALP of the predicted substrate-binding site in transmembrane 
segment M4 into PPVLP thatis present in ATP13A5*”°. Also, neighbour- 
ing membrane helices contribute to substrate coordination in P-type 
ATPases, which often relies on conserved and charged residues, such 
as D962 in M6 and K1062 in M8 of ATP13A2‘. Furthermore, mutants 
associated with Kufor-Rakeb syndrome (T512I?”? and G872R**) or 
early-onset Parkinson’s disease (T12M?, G528R? and A741T** *”) were 
generated. All cell lines were produced at varying viral vector titres 
and assessed for equal expression to wild-type ATP13A2. 

For CRISPR-Cas9-mediated knockout of ATP13A2, the lentiviral vec- 
tor lentiCRISPRv2 (Addgene, 52961)** was used. First, the Cas9 cas- 
sette was transformed into a high-fidelity Cas9 by Gibson assembly 
with a gBlock gene fragment (Integrated DNA Technologies) of the 
Cas9 portion encoding a protein product with high-fidelity mutations 
(N497A/R661A/Q695A/Q926A). This Cas9 variant triggers fewer off- 
target events while retaining its on-target activity”. A single-guide 
RNA (sgRNA) targeted to Atp13a2 was designed taking into account 
a high on-target efficiency, using sgRNA Designer (https://portals. 
broadinstitute.org/gpp/public/analysis-tools/sgrna-design)*°, anda 
low off-target efficiency, via CRISPR Design (http://crispr.mit.edu/). 
A lentiviral CRISPR-Cas9 high-fidelity expression plasmid was cre- 
ated by inserting fragments that contained an sgRNA sequence of 
ATP13A2 (forward, 5’-CACCGGTCAGGGTCCCATAACCGGT; reverse, 
5’-AAACACCGGT TATGGGACCCTGACC) into the lentiCRISPRv2 vector. 
The generated CRISPR-Cas9 high-fidelity ATP13A2 plasmid (1,000 
ng), the packaging plasmid pCMV-AR8.91 (900 ng) and the envelope 
plasmid pMD2G-VSV-G (Addgene, 12259) (100 ng) were mixed together 
with 200 ul of JetPrime buffer and 4 pl of JetPrime reagent (Polyplus- 
transfection) for transfection of HEK-293T cells according to the 
manufacturer’s protocol. After 4h at 37 °C and 5% CO,, the serum-free 
medium was replaced with DMEM/F12 (Dulbecco’s modified Eagle’s 
medium, Nutrient Mixture F-12) supplemented with 10% fetal calf serum 
(heat-inactivated). After 48 h, the lentiviral vectors were collected 
by passing the medium through a 0.45-um filter, and 0.5 ml of this 
medium was used to transduce SH-SYSY cells supplemented with 8 
pg ml” polybrene (Sigma-Aldrich). 24 h after transduction, cells were 
selected in3 pg ml puromycin (Sigma-Aldrich) and passaged three 
times before single clones were isolated via serial dilution. The result- 
ing cells were examined by quantitative PCR (qPCR) (A7P13A2 forward, 
5’-ACCGGT TATGGGACCCTGAC; ATP13A2 reverse, 5’-GTGATAGCCGA 
TGACCCTCC) with HPRT and TBP as internal controls (HPRT forward, 
5’-TGAGGAT TTGGAAAGGGTGTTT; HPRT reverse, 5’-ACATCTCGA 
GCAAGACGTTCAG; TBP forward, 5’-CGGCTGTTTAACTTCGCTTC; 
TBP reverse, 5’°-CACACGCCAAGAAACAGTGA) and western blotting. 
For rescue experiments, ATP13A2 knockout cells were stably transduced 
with lentiviral vectors expressing wild-type ATP13A2 or the D508N 
mutant in which both cDNAs were modified with synonymous muta- 
tions at the sgRNA target site. All cell lines were routinely assessed for 
mycoplasma and cultured for a maximum of 20 passages. 


Membrane fractionation 

SH-SY5Y cells were seeded in 15-cm dishes at a density of 6 x 10° cells per 
plate. The cells were collected 24 h later after trypsinization and brief 
centrifugation (300g, 5 min). Subcellular fractionation was performed 
by differential centrifugation, as described previously? “. The micro- 
somal protein concentration was measured using the bicinchoninic 
acid assay (Thermo Fisher Scientific, Pierce) according to the manu- 
facturer’s instructions. 


ATPase assay 

The ATPase activity of ATP13A2 was assessed using a commercially 
available luminescence assay (ADP-Glo Max assay, Promega) that 
monitors the production of ADP via luciferase activity. Substrate 
screen was designed to include candidates previously postulated in 
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the literature® **" *. The reactions were performed for 30 min (37 °C) 
ina final volume of 25 pl. The assay reaction mixture contained 50 mM 
MOPS-KOH (pH 7), 100 mM KCI, 11mM MgCl, 1mM DTT, 195 uM DDM, 
various concentrations of the indicated compound, and either micro- 
somes (5 pg) collected from SH-SYSY cells overexpressing ATP13A2 
(wild-type or mutants) or purified ATP13A2 (0.3-0.5 pg). When puri- 
fied ATP13A2 was used, we included 125 uM phosphatidic acid, 125 uM 
PtdIns(3,5)P, and 19.5 uM DDM in the reaction buffer. The assay was 
started by the addition of 5 mM ATP and was terminated by adding 
25 pl of ADP-Glo Reagent. The 96-well plate was then incubated for 40 
min at room temperature, followed by the addition of 50 pl of ADP-Glo 
Max Detection Reagent. After 40 min, luminescence was detected 
using a FlexStation 3.0 system (Molecular Devices). Dose-response 
curves and n, K,, and V,,,, Values were calculated using GraphPad Prism 
Software (GraphPad Software). 


Autophosphorylation assay 

The autophosphorylation activity of ATP13A2 on the conserved D508 
residue was measured as described previously”. In brief, microsomes 
(20 pg) or purified ATP13A2 (1 pg) were incubated with radioactively 
labelled ATP in the presence of the indicated SPM or ornithine con- 
centrations, and after 1 min the reaction was stopped. In the case of 
purified ATP13A2, 125 uM phosphatidic acid and 125 uM PtdIns(3,5) 
P, were included in the reaction mixture. To determine the sensitiv- 
ity of the ATP13A2 phosphoenzyme to ATP or a combination of ATP 
and SPM, 30s after adding *P-ATP samples were incubated with non- 
radioactive ATP (5S mM in experiments using microsomes, 1 mM for 
purified ATP13A2) and SPM (1 mM) before the reaction was stopped 
at the indicated time points. The incorporation of ”P was visualized 
after SDS-PAGE under acidic conditions and subsequently detected 
by autoradiography (microsomes) or liquid scintillation counting 
(purified ATP13A2) (Liquid Scintillation Analyzer TRI-CARB 2900TR). 


Transformation and overexpression of ATP13A2 in yeast 

The S. cerevisiae W303-1B/Gal4-APep4 strain (leu2-3, his3-11,15, trp1- 
1::TRP1-GAL10-GAL4, ura3-1, ade2-1, canr, cir’, APep4 MAT; agift from 
R. Lopez Marques) was transformed according to the lithium acetate/ 
single-stranded carrier DNA/polyethylene glycol method*®, with the 
pYeDP60 vector containing a yeast codon-optimized version of human 
ATPI3A2 variant 2 (wild-type or the catalytically dead E343A mutant) 
followed by a thrombin cleavage site and a C-terminal BAD tag*”*8, 
The transformation mixture was grown for 48 h at 30 °C on minimal 
medium agar plates lacking uracil (0.54% yeast nitrogen base with- 
out amino acids, 0.12% yeast drop-out mix without uracil, 2% glucose 
and 2% agar) to select yeast colonies that acquired the plasmid. These 
colonies were then cultured in 20 ml of MM-Ura medium (0.67% yeast 
nitrogen base without amino acids, 0.19% yeast drop-out mix without 
uracil, and 2% glucose) and grown for 24 h at 28 °C and 200 rpm. The 
MM-Ura yeast pre-culture was used to inoculate 100 ml of MM-Ura 
medium to a final OD¢o of 0.2, followed by a 12-h incubation period 
(28 °C and 200 rpm). The second pre-culture was inoculated into 4.51 
of YPGE2X medium (2% yeast extract, 2% bactopeptone, 1% glucose 
and 2.7% ethanol) to a final OD,o, of 0.05 and grown for 36h (28 °C and 
175 rpm). ATP13A2-BAD expression was induced with 2% galactose, 
followed by asecond galactose induction 12 h later. After another 12h, 
the pellet was collected (1,000g, 10 min, 4 °C). 


Yeast membrane preparation 

Yeast cells were broken with glass beads using a BeadBeater (BioSpec 
products). The lysis buffer contained 50 mM Tris-HCl (pH 7.5), 1 mM 
EDTA, 0.6 M sorbitol, 1mM phenylmethylsulfonyl fluoride and Sigma- 
Fast protease inhibitor. To remove cell debris and nuclei, the crude 
extract was centrifuged at 2,000g for 20 min (4 °C). The supernatant 
(S1) was centrifuged at 20,000g for 20 min (4 °C) to pellet the heavy 
membrane fraction (P2), and the resulting supernatant (S2) was further 


centrifuged at 200,000g for 1h (4 °C). The resulting pellet (that is, the 
light membrane fraction, P3) was resuspended in 20 mM HEPES-Tris 
(pH 7.4), 0.3 Msucrose and 0.1mM CaCl,. The total protein concentra- 
tion was determined using a Bradford assay (B6916, Sigma-Aldrich). 


Purification of ATP13A2-BAD 

The method used to purify ATP13A2 was based on the purification of 
overexpressed Drs2p from yeast membranes“. Yeast P3 membranes 
were diluted to5 mg of total protein per ml in SSR buffer (SO mM MOPS- 
KOH (pH 7), 100 mM KCI, 20% glycerol, 5 mM MgCl, 1 mM DTT, and 
SigmaFast protease inhibitor cocktail) and solubilized using DDM, 
with a detergent-to-protein ratio of 1:1. The samples were stirred onice 
for 30 min, followed by centrifugation (100,000g, 1h, 4 °C) to pellet 
non-solubilized membranes. The solubilized material was incubated 
with streptavidin beads for 4 h (4 °C) to enable binding of BAD-tagged 
ATP13A2 to the resin. To eliminate unbound material, the resin was 
washed four times with three resin volumes of SSR buffer supplemented 
with 0.5 mg ml DDM. Subsequent cleavage by thrombin (0.0625 U per 
mg total protein) enabled the release of ATP13A2 from the beads by 
overnight incubation at 4 °C. Finally, a Vivaspin Turbo 4 concentrator 
(100 kDa molecular weight cut-off, Sartorius) was used to concentrate 
the sample. The protein concentration was determined using a Bradford 
assay. The quality of the purification was evaluated via SDS-PAGE fol- 
lowed by Coomassie staining or immunoblotting, as described previ- 
ously’ ". Furthermore, the purified ATP13A2 sample was analysed by 
linear mode MALDI-TOF MS (matrix-assisted laser desorption time- 
of-flight mass spectrometry; Applied Biosystems 4800 Proteomics 
Analyzer) inthe presence of a-cyano-4-hydroxycinnamic acid as matrix 
and after C4 omix (Agilent) pipette tip purification. 


Reconstitution of yeast membranes 

To reconstitute yeast membranes, we followed a similar strategy as 
described before*’. P3 membranes from the yeast membrane prepara- 
tion expressing the ATP13A2-BAD construct were solubilized in buffer T 
(10 mM Tris-HCl (pH 7.4) and 1 mM EDTA) supplemented with 1.4% DDM. 
After removing the insoluble fraction by ultracentrifugation (30 min, 
200,000g), the detergent extract was supplemented with 4.5 mM egg 
phosphatidylcholine and 0.5 mM 18:1 phosphatidic acid (in buffer T 
containing 0.7% DDM). The extract was then treated with Bio-Beads to 
remove the DDM and reconstitute proteoliposomes (that is, ‘no ATP 
inside’ condition). To generate proteoliposomes that contained intra- 
luminal ATP (that is, ‘ATP inside’ condition), weadded5 mM ATP andan 
ATP-regenerating system before incubation with the Bio-Beads. Finally, 
the vesicles were recovered by ultracentrifugation (1h, 200,000g) and 
resuspended in buffer T. The protein concentration was determined 
using a Bradford assay. 


Transport assay using reconstituted vesicles 

Uptake of ?H-SPM into freshly prepared vesicles was measured within 60 
min. The above-described vesicles (‘no ATP inside’ or ‘ATP inside’) were 
diluted to1 pg pin buffer T. The reactions were performed for 10 min 
(30 °C) ina final volume of 1 ml. The assay reaction mixture contained 
50mM MOPS, 100 mM KCI, 11mM MgCl, 1mM DTT and reconstituted 
vesicles (45 1g). The reaction was started by adding 1 mM?H-SPM. For 
the condition ‘no ATP inside’, 5 mM ATP and anATP-regenerating system 
were added after Bio-Bead treatment, before the addition of 7H-SPM. 
The reaction was stopped by filtering the samples through Millipore 
filters (0.45 pm). After washing of the filters with assay buffer, radio- 
activity retained on the filters was counted using a liquid scintillation 
counter (Liquid Scintillation Analyzer TRI-CARB 2900TR). 


Cellular transport assay and endocytosis assessment 

BODIPY-SPD and BODIPY-SPM were synthesized as described previ- 
ously (compounds 14 and 15, respectively)"®. The cells were seeded in 
12-well plates (1.0 x 10° cells per well) and the next day the cells were 


incubated with 5 uM BODIPY-SPM or BODIPY-SPD for 2 h before col- 
lection. To assess endocytosis, the cells were pre-treated (30 min) with 
the endocytosis inhibitors Dynasore (100 LM), genistein (50 pM) and/or 
Pitstop 2 (50 LM) before the addition of either 20 pg mI FITC-dextran 
or BODIPY-SPM (2 hand 37 °C). The cells were then collected (300g, 
5 min), washed and resuspended in PBS containing 1% BSA. Finally, 
an Attune Nxt (Thermo Fisher Scientific) flow cytometer was used to 
record the mean fluorescence intensities (MFI) of 10,000 events per 
treatment. 


Metabolomics 
Cells were grown into a 6-well plate and extracted as described previ- 
ously°°. In brief, the medium was removed and cells were washed with 
a0.9% NaCl solution. The washing solution was removed and 150 ul of 
a 6% trichloroacetic acid (Sigma) was added for the extraction. Using 
a cell scraper, the full extract was transferred into an eppendorf and 
incubated for 30 min on ice. Insoluble material, such as precipitated 
proteins, was removed by centrifugation for 20 min at 20,000g at 4 °C. 
To 100 pl of the supernatant, 900 pl of al100 mM sodium carbonate 
buffer (pH 9.0) was added. Next, 25 pl of isobutyl chloroformate (Sigma) 
was added and the mixture was incubated for 30 min at 35 °C. 800 pl 
of the reaction mixture was transferred to a2 ml eppendorf tube and 
1 ml of diethylether (Sigma) was added. The mixture was vortexed 
vigorously and placed for 15 min at 25 °C. 900 ul of the upper phase was 
transferred into an eppendorf and dried using a vacuum centrifuge. 
Finally, the dried extract was dissolved in 125 pl of a50% acetonitrile 
(LC-MS grade, Merck) solution in water containing 0.2% acetic acid. 
15 pl of the extract was loaded onto a Thermo Scientific Liquid 
Chromatography QQQ (Quantiva, Thermo Fisher Scientific) equipped 
with an ACQUITY UPLC BEH C18 (1.7 pm, 2.1 x 100 mm) column from 
WATERS. Solvent A consisted of ultrapure H,O with 0.2% acetic acid 
while solvent B was acetonitrile (Merck) with 0.2% acetic acid; all 
solvents used were LC-MS grade. Flow rate remained constant at 
250 pl min“, and the column temperature remained constant at 30 °C. 
A gradient for the separation of modified polyamines was applied 
as follows: from 0 to 2 min 20% B, from 2 to 10 mina linear increase 
to 85% B was carried out and 85% B was maintained until 17 min. At 
18 minthe gradient returned to 20% B. The method stopped at 22 min. 
The mass spectrometer operated in positive ion mode (3,500 V); the 
source settings were as follows: sheath gas at 50, aux gas at 10, the 
ion transfer tube was heated at 325 °C and the vaporizer temperature 
was set at 350 °C. The mass spectrometer operated in multiple reac- 
tion monitoring mode and used the following transitions: putrescine 
(parent m/z at 289.2 > fragment 215.2, collision energy at 10.25 V), SPD 
(parent m/z at 446.4 > fragment 198.2, collision energy at 23.6 V) and 
SPM (parent m/zat 603.4 > fragment 455.1, collision energy at 19.91V). 
Peak area was integrated using the XCalibur Quan tool (version 
4.2.28.14, Thermo Fisher Scientific). 


Preparation and characterization of acidic nanoparticles 

Acidic nanoparticles were prepared as previously described”. In brief, 
31mg of Resomer RG 503H (lactide to glycolide ratio 50:50, molecular 
weight 24-38 kDa) was dissolved in 3.1 ml of tetrahydrofuran and sub- 
sequently 200 pl of this solution was added to 20 ml of ultrapure water 
under sonication. The suspension was then concentrated using the 
rotary evaporator to a final volume of approximately 12 ml, resulting 
in aconcentration of 0.167 mg mI“. The size distribution of prepared 
nanoparticles was measured using a Wyatt DynaPro DLS plate reader 
(Wyatt), using an 830-nm laser ina flat-bottom 384-well plate (Greiner) 
at 25 °C and 10 measurements were averaged per experiment. 


MUH cytotoxicity assay 

SH-SYSY cells were seeded in 96-well plates (1 x 10* cells per well) and 
allowed to adhere overnight. The cells were subsequently treated 
with increasing doses of the indicated compounds for 24-48 h. After 


exposure, the cells were washed with PBS and stained with300 pg mI 
MUH (preparedin DMSO, dissolved in PBS) for 30 min at 37 °C. Cytotox- 
icity was read using a FlexStation 3.0 multi-well plate reader (Molecular 
Devices; excitation 360 nm, emission 460 nm, cut-off 455 nm). Data 
were expressed relative to control. 


Propidium iodide exclusion assay 

Cells were seeded in 12-well plates (1.0 x 10° cells per well) and the next 
day the cells were treated with increasing doses of SPM, alone or in 
combination with the cathepsin B inhibitor CA-074 (25 uM, Lh pre- 
incubation) or acidic nanoparticles (180 ng mI”, 1h pre-incubation), 
and incubated for 24 h at 37 °C. Thereafter, the cells were collected fol- 
lowing trypsinization and a brief centrifugation (300g; 5 min), washed 
with PBS and stained with 1 pg mI” propidium iodide (in PBS containing 
1% BSA). An Attune Nxt (Thermo Fisher Scientific) flow cytometer was 
used to determine the proportion of propidium iodide-positive cells 
(10,000 events per treatment). 


FITC-dextran-based lysosomal pH 

The protocol was adapted from refs. *"*”. SH-SY5Y cells were seededin 
12-well plates (1.0 x 10° cells per well) and allowed to adhere overnight. 
Cellswereexposedto50 1g ml“ FITC-dextran for 72h. Samples werethen 
washed and placed in fresh medium for 2h before treatment with SPM 
(10 pM) alone or in combination with acidic nanoparticles (180 ng mI", 
1h pre-incubation) for a further 4 h. Samples were then collected by 
centrifugation (300g, 5 min) and washed in PBS. Cells were finally resus- 
pended in 500 pl of PBS containing 1% BSA and FITC dual emission was 
assessed by flow cytometry (excitation 488 nm, emission 530 nm (BL1) 
and 600 nm (BL2)) of 10,000 events per condition using an Attune NXT 
flow cytometer (Thermo Fisher Scientific). The emission ratio (BL1/ 
BL2) ofall samples were compared toa standard curve, whereby signals 
were obtained from untreated cells resuspended in monensin (100 
pM) containing Britton Robinson buffer with increasing pH (3.0-8.0). 


Lysosomal degradative capacity 

Cell lines were seeded in 12-well plates (1.0 x 10° cells per well) and the 
next day the cells were pre-treated with SPM (10 pM) for 1h at 37 °C. 
For samples requiring exposure to acidic nanoparticles, cells were 
treated with 180 ng mI‘ 1h before the addition of SPM. Subsequently, 
5 wg ml DQ-Green BSA was added to the cells for a further 3 h (37 °C). 
Finally, the cells were collected (300g, 5 min), and the MFI of 10,000 
events were assessed using an Attune Nxt (Thermo Fisher Scientific) 
flow cytometer. 


Cathepsin activity assays 

SH-SY5Y cells were seeded in 10-cm plates (2 x 10° cells per plate) and 
allowed to adhere overnight before treatment with 10 uM SPM for 4h 
at 37 °C. For samples requiring the addition of acidic nanoparticles, 
180 ng ml nanoparticles were added 1 h before the addition of SPM. 
Next, the samples were collected using TRYPLE and a brief centrifuga- 
tion (300g, 5 min). The activities of cathepsin B (ab65300) and cath- 
epsin D (ab65302) were assessed using commercially available kits 
(Abcam) according to the manufacturer’s instructions. MFI values were 
acquired using a FlexStation 3.0 multi-well plate reader (Molecular 
Devices). 


Lysosomal membrane integrity 

SH-SY5Y cells were seeded in 12-well plates (1.0 x 10° cells per well) and 
the next day the cells were incubated with 5 pg ml acridine orange 
(dissolved in medium) for 15 min at 37 °C. Thereafter, the medium 
was discarded, the cells were washed, and fresh medium was added. 
For samples requiring acidic nanoparticles, cells were treated with 
180 ng mI 1h before the addition of SPM. The cells were then treated 
with 10 uM SPM for 4 hat 37 °C. Finally, the cells were collected and 
resuspended in PBS containing 1% BSA. The MFI of 10,000 events was 
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captured using an Attune Nxt (Thermo Fisher Scientific) flow cytom- 
eter. 


BODIPY-SPM localization and lysosomal rupture analysis 

For allimmunofluorescent stainings, SH-SYSY cell lines were seeded in 
12-well plates (0.25 x 10° cells per well on coverslips) and the next day 
the cells were incubated with the indicated compounds for 4h at 37 °C. 
For BODIPY-SPM analysis, cells were pulsed with 5 pM BODIPY-SPM 
for 15 min, washed and chased in fresh medium for a further 105 min 
at 37 °C. After treatment, the cells were washed twice in PBS, fixed in 
4% paraformaldehyde for 30 min (37 °C), washed in PBS and stored at 
4 °C. For immunofluorescence staining, the cells were washed in PBS 
containing 0.5% Tween 20 (PBS-T), permeabilized in PBS containing 
0.1% Triton X-100 (30 min) and blocked first in 0.1 M glycine (1h) and 
then in PBS-T containing 1% fetal calf serum and 10% BSA (30 min). Gal- 
3-, LAMP1- and cathepsin B-specific antibodies were used at a dilution of 
1:100-200 (in PBS-T containing 1% BSA) overnight at 4 °C. Subsequently, 
the samples were washed and incubated with Alexa Fluor secondary 
antibody (1:1,000; 30 min). To assess FITC-dextran release, cells were 
loaded with 50 pg ml“ FITC-dextran for 72 h, cells were then washed 
and placed in fresh medium for 1h before the addition of SPM (10 pM). 
To visualize the nucleus, all samples were stained with DAPI (200 ng mI, 
15min). After staining, the samples were fixed to slides, and images were 
acquired using an LSM780 or LSM880 confocal microscope (Zeiss). For 
the acquisition of BODIPY-SPM, images were taken with equal settings 
toconfirm uptake potential of ATP13A2 (Extended Data Fig. li). To assess 
the intracellular distribution of BODIPY-SPM (Fig. 2d, Extended Data 
Fig. 1j-l), microscope settings were optimized per cell type to enable 
acomparable assessment of BODIPY-SPM localization in KO/WT and 
KO/D508N cells. 


Neuron isolation 

Primary cortical neurons were derived from FVB/N mice embryos at 
embryonic day (E)16. Pregnant mice were euthanized on gestation 
day 16 by cervical dislocation. The brains of E16 mouse pups were col- 
lected and placed in a dish containing calcium- and magnesium-free 
Hanks’ Balanced Salt Solution (HBSS, Life Technologies, 14180-046) 
on ice. Both cerebral hemispheres were separated from the cerebel- 
lum. Meninges were removed from the cerebral hemispheres and the 
brain cortices dissected. Brain cortices were collected, washed twice 
and digested with 0.05% trypsin (Life Technologies, 25300-054, 10 min 
at 37 °C). The trypsin reaction was terminated by the addition of 7 ml 
HBSS and 1 ml of horse serum. Cells were separated by pipetting and 
filtration through a cell strainer (40 pm, Falcon, 352340). Cells were 
centrifuged at 1,000 rpm for 5 min (4 °C), the supernatant discarded 
and the pellet suspended in5 ml Dulbecco’s modified Eagle’s medium 
(DMEM; Sigma-Aldrich, D6546) containing GlutaMAX (Life Technol- 
ogies, 31966-021), 5% horse serum (Life Technologies, 26050-088) 
and 20 mM glucose (Sigma-Aldrich, 8270). Primary cortical neurons 
were plated in 12-well plates, on coverslips coated with poly-D-lysine 
(Sigma-Aldrich, P6407). After an overnight incubation, cell medium 
was exchanged for Neurobasal medium (Life Technologies, 21103-049) 
supplemented with 2 mM L-glutamine (Life Technologies, 25030-24) 
and 2% B27 (Life Technologies, 17504-044). 


Atp13a2 knockdown and rescue in isolated cortical neurons 

For knockdown, microRNA (miR)-based short-hairpin lentiviral vec- 
tors were generated as described®*. The two most potent miRs against 
mouse Atp13a2 (mouse miR-3, CCACGCCGAAACACTCGTTATA and 
mouse miR-5, CGCCGAAACACTCGT TATAGAA) were used to induce 
knockdown in mouse primary neurons. A miR targeting Fluc was used as 
acontrol (miR-Fluc, ACGCTGAGTACTTCGAAATGTC). Primary neurons 
were transduced 4 days after isolation, 72 h before experimentation. 
For rescue of ATP13A2 expression in knockdown conditions, neurons 
were subjected to asecond round of transductions with either human 


wild-type or the D508N variant of ATP13A2, 24 hafter the addition of the 
miR. Fluc was used as an overexpression control. At day 7 post-isolation, 
cells were treated with 10 uM SPM for 24 h. To test the contribution 
of cathepsin B to SPM-induced cell death, neurons were pre-treated 
(30 min) with 10 pM CA-074. Knockdown efficiency was validated with 
qPCRon mRNA levels 72 h after transduction using the following prim- 
ers (Atp13a2 forward, CATGGCCCTCTACAGCCTGA; Atp13a2 reverse, 
CTCATGAGCACCGCAACCGT) with Gapdhas internal control (forward, 
TGTGTCCGTCGTGGATCTGA; reverse, CCTGCTTCACCACCTTCTTGA). 
All mouse primary neuron experiments were carried out in accordance 
with the European Communities Council Directive of November 24 
1986 (86/609/EEC) and approved by the Bioethical Committee of the 
KU Leuven (Belgium) (ECD project P185-2014). 


TUNEL staining 

TUNEL staining was assessed according to the manufacturer’s protocol 
for the Click-iT Plus TUNEL assay (Thermo Fisher Scientific, C10617). 
DAPI was used as a nuclear counterstain and images were acquired 
using aLSM780 confocal microscope. 


Caenorhabditis elegans 

During routine culture, nematodes were grown on NGM (nematode 
growth medium) using Escherichia coli strain AMA1004 as food 
source», In order to reduce the likelihood that bacterial catabo- 
lism and divalent cations would interfere with SPD assays, these were 
done using DCDA (divalent-cation-depleted agar) plates, which do not 
permit bacterial growth. This medium contains 2% agar (Carl Roth 
5210) that has been washed with 50 mM EDTA, followed by multiple 
washes with reverse osmosis-purified water; 50 mM HEPES pH 7.4; 
and 20 pg mI" kanamycin. SPD trihydrochloride (1M stock solution; 
Sigma $2501) was added toa final concentration of 5 mM after micro- 
waving to melt the agar. Assays were performed in 35-mm plates that 
contained 1 ml of DCDA. Approximately 2 pl of £. coliwas transferred 
to each assay plate from the lawn of aseeded NGM plate. In each case, 
multiple adult hermaphrodites were added to the assay plate and 
allowed to lay eggs for 8 h (datasets 1 and 3) or 15 h (dataset 2) at 
23.5 °C, then removed. Incubation of plates at 23.5 °C was continued 
until the O mM SPD N2 plates contained many mid-late stage L4s, at 
which point worms were rinsed off the plates and stored in microfuge 
tubes at -20 °C. Worm lengths were determined by mounting the 
thawed (dead) animals on an agarose pad covered by a coverslip, 
capturing images using a Leica M205 FA microscope equipped with 
digital camera plus software, and measuring the length of each worm 
from snout to tail tip using ImageJ. 

C. elegans expresses three ATP13A2 orthologues: CATP-5, CATP-6 
and CATP-7. Therefore, in this study the following mutant alleles were 
used, each of which was backcrossed to N2 Bristol at least three times: 
catp-6(0k3473) WV, catp-7(tm4438) IV, catp-5(tm4481) X*. Each of these 
is anull allele, so they are referred to in the text as catp-#(O). 

Transgenic strains carrying extrachromosomal arrays with the 
pRF4 rol-6(su1006) plasmid were generated by microinjection and 
maintained as described previously”’. Injection mixes typically 
contained 100 pg mI pRF4, plus 30 pg mI™ of the test construct. 
In some cases, Pmyo-2::gfp was included at approximately 5 pg mI 
as an additional method for detecting transgenic animals. To gen- 
erate transport-defective versions of catp-6 and catp-7, plasmids 
carrying catp-6::mKate2 and catp-7::GFP” were modified to change 
the coding sequence for the conserved DKTGT autophosphoryla- 
tion motif to NKTGT. Proper expression and subcellular localiza- 
tion of CATP-6::mKate2 and CATP-7::GFP was verified in the case 
of all transgenic strains that were used. (catp-6(0);[catp-6(+)]) and 
catp-7(0);[catp-7(+)] express CATP-6 and wild-type CATP-7 in the 
null background, respectively; catp-6(0);[catp-6(D465N)] and catp- 
7(0);[catp-7(D503N)] express CATP-6(D465N) and CATP-7(D503N) 
inthe null background). 


Statistics and reproducibility 

Data are expressed as the mean + s.e.m. or with individual data points 
(replicates of multiple independent experiments) shown on group 
means or box and whisker plots (indication of median, 25th percentile, 
75th percentile and minimum to maximum value range). Flowcytometry 
was Set up and gated as described in Supplementary Fig. 2. GraphPad 
Prism 7.04 was used to plot all graphs and to perform all of the required 
statistical and quantitative assessments. Statistical tests for each graph 
are described in the legend. The number of independent biological 
experiments for each panel is highlighted in the figure legends. For 
the cell biological experiments using the SH-SYSY control, ATP13A2 
KO, KO/WT and KO/DSO8N cell lines, each cell model is the sum of two 
independent clones, each performed a minimum of three independent 
times. For the quantification of immunoblots and radiograms, ImageJ 
and ImageQuant programmes were used. Experiments on various model 
systems were executed by different researchers, which provided con- 
sistent results that independently confirmed the major conclusions. 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Gel source data for immunoblots and radiograms (Figs. 1, 2, Extended 
Data Figs. 1-3, 6) are available with the online version of the paper (Sup- 
plementary Fig. 1). All other datasets generated within this study are 
presented and analysed within this manuscript and are available from 
the corresponding author upon reasonable request. 
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Extended Data Fig. 1| ATP13A2is a polyamine transporter. a—d, The ATPase h, Purified ATP13A2 was incubated with [y-P]ATP in the presence of l1mMSPM, 


activity of ATP13A2 was measured in solubilized microsomes (5 pg) collected and radioactivity of the phospho-intermediate was assessed by scintillation 
from SH-SYSY cells stably overexpressing wild-type ATP13A2 (WT-OE) inthe counting. i, Comparison of the pulse (5 1M, 15 min) chase (105 min, medium) 
presence of 100 nM CaCl,, MnCl,, ZnCl, or FeCl, and 100 1M SPD or SPM BODIPY-SPM uptake in KO/WT and KO/DSO8N cell lines by confocal 
(DSO8N-OE as a negative control, wild type was referenced from Fig.1b)(a)orin microscopy. Cells were subsequently stained with LAMP1 and imaged with the 
the presence of the indicated doses of inorganic ions and heavy metals CaCl,, same laser settings by confocal microscopy. DAPI was used to visualize the 
MnCl,, ZnCl, or FeCl, (b), diamines (cadaverine, agmatine and the amino acid nuclei. Scale bar, 5 m.j, Line intensity plots of the indicated dashed linesin 
L-arginine) (c), monoamines (dopamine and histamine) (d) and acetylated Fig. 2d.k, Analysis of the Pearson’s coefficient of LAMP1and BODIPY-SPM for 
polyamines (N'-acetylspermine, N°-acetylspermidine or N'-acetylspermidine) the images in Fig. 2d (KO/WT, 78 images; KO/DSO8N, 85 images). I, Mean 

(e). Asareference for c-e, we plotted the dose-response curve of SPM from fluorescence intensities (MFI) of BODIPY in DAPI-positive regions of samples 
Fig. 1b. f, Microsomes (20 pg) collected from SH-SYSY cells that overexpress shown in Fig. 2d (KO/WT, 233 nuclei; KO/D508N, 243 nuclei). Data are presented 
ATP13A2 were incubated for 60 s with [y-*P]ATP in the presence of 10 mM as the mean +s.e.m. or mean with individual data points shown (points 
ornithine (ORN) or SPM (referenced from Fig. 1c). Left, arepresentative represent replicates), with n =3 independent biological experiments. Analysis 
autoradiogram of the phosphoenzymes (EP); right, quantification. CON, was carried out using one-way ANOVA with Dunnett’s (f) or Tukey’s (a, g) 
control. g, The ATPase activity of purified ATP13A2 was assessed after 1mM corrections, or by two-tailed t-tests (unpaired, hor Welch’s,k, I). Fitted lines are 
SPM was administered inthe presence or absence of 0.25 mM orthovanadate semi-log lines (b) or nonlinear allosteric sigmoidal association (c-e). For gel 
(ORTH), a general P-type ATPase inhibitor (supplemented with125 1M source data, see Supplementary Fig. 1. 


phosphatidic acid/PtdIns(3,5)P,; conditions (—) and (-)/SPM refer to Fig. If). 
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Extended Data Fig. 2 | Streptavidin-based purification of wild-type ATP13A2 
and the catalytically dead E343A mutant. a, Coomassie staining showing the 
purification process for wild-type ATP13A2, starting from solubilized yeast 
membrane fractions, followed by streptavidin affinity chromatography and 
on-column thrombin cleavage to elute the protein. b, Western blot analysis of 
stages inthe purification of ATP13A2.c, Bar graph depicting protein purity as 
determined by densitometry from Coomassie-stained SDS-PAGE. d, Mass 
spectrometry analysis of the purified ATP13A2 sample. Singly, doubly and 
triply charged species are indicated. e, To evaluate phosphoenzyme formation, 
yeast P3 membranes (20 pg) and purified ATP13A2 (1 1g) were incubated for 
60s with [y-“P]ATP. As a positive control, microsomes collected from SH-SY5Y 
cells that overexpress wild-type ATP13A2 (20 1g) were used. Theimageisa 
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representative radiogram depicting the ATP13A2 phosphoenzyme. f, The 
ATPase activity of purified ATP13A2 (0.3 pg) was measured in the presence of 
2mM SPM and the indicated concentrations of the ATP13A2 regulatory lipids 
phosphatidic acid (PA) and PtdIns(3,5)P,.g, Coomassie staining showing the 
purification process for ATP13A2(E343A).h, The ATPase activity of purified 
wild-type ATP13A2 or ATP13A2(E343A) (0.5 pg) was measured in the presence 
of the indicated concentrations of SPM with 125 uM phosphatidic acid and 
125 uM PtdIns(3,5)P,. Data are expressed as mean with individual data points 
(points represent replicates) (c,h). The number of independent biological 
experiments were as follows: n=3 (b, e-h); n=6 (d); n=22 (a,c). Analysis was 
performed using one-way ANOVA with Tukey’s post-hoc correction (h). For gel 
source data, see Supplementary Fig. 1. 
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Extended Data Fig. 3 | Confirmation of CRISPR-Cas9-mediated ATP13A2 
knockout and subsequent rescue with wild-type ATP13A2 or the DS508N 
mutant.a, The ATP13A2 knockout cell lines (KO) were generated by CRISPR- 
Cas9 in SH-SYSY cells and confirmed by qPCR (top) andimmunoblotting 
(bottom). Atp13a2 mRNA expression was normalized to hypoxanthine 
phosphoribosyltransferase (HPRT) and TATA-sequence-binding protein (TBP), 
and GAPDH was used asa loading control for the ATP13A2 protein levels. Two 
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fragments of the same blot are depicted and separated by a dotted line. 

b, Rescue of ATP13A2 knockout was performed by lentiviral transduction to 
express wild-type ATP13A2 (KO/WT) and the catalytically dead mutant DSO8N, 
which was used as a negative control (KO/DS5O8N). The expression of the 
ATP13A2 constructs was confirmed viaimmunoblotting. The number of 
biologically independent experiments were as follows: n=1 (a, top panel); n=3 
(a, bottom panel, b). For gel source data, see Supplementary Fig. 1. 
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Extended Data Fig. 4 | Polyamine uptake by ATP13A2 complements cytosolic 
polyamine synthesis. a—c, Assessment of the cellular uptake of BODIPY- 
labelled polyamine analogues (a, b) or FITC-dextran (c) by flowcytometry. 
Uptake of BODIPY-SPD (a) or BODIPY-SPM (b) in SH-SY5Y cells overexpressing 
Fluc (negative control), wild-type ATP13A2 (WT-OE) or the catalytically dead 
mutant DSO8N (D508N-OE). The cells were incubated with 5 1M BODIPY-SPM 
or BODIPY-SPD for 2h before analysis by flow cytometry. c, Analysis of FITC- 
dextran uptake (as ameasure of endocytic capacity) was performed in SH-SY5Y 
control (CON) cells with endogenous ATP13A2 expression, ATP13A2 knockout 
cells (KO) and rescue cell lines with expression of wild-type ATP13A2 (KO/WT) 
or the DSO8N mutant (KO/DS508N). The cells were pre-treated for 30 min witha 
combination (combo) of endocytosis inhibitors Dynasore (100 1M), genistein 
(50 1M) and Pitstop 2 (50 uM). The cells were incubated for an additional 2h 


with FITC-dextran at 37 °C, followed by flow cytometry. d, Schematic 
representation of polyamine synthesis. SRM, spermidine synthase; SMS, 
spermine synthase. Specific inhibitors are indicated in red. Control, KO, KO/ 
WT or KO/DSO8N cells were subjected to inhibition of polyamine synthesis by 
DFMO (e), 4MCHA (f) and APCHA (g) before measuring cell viability via the 
MUHassay. All data represent the average of two independent CRISPR-Cas9 
knockout and control clones. All data are presented as mean with data points 
overlaid (points represent replicates) or mean+s.e.m. The number of 
biologically independent experiments were as follows: n= 3 (a, b);n=4 (c, e-g). 
Analysis was performed using one-way ANOVA with Dunnett’s (a, b) or Tukey’s 
(c) post-hoc correction or two-way ANOVA with Dunnett’s post-hoc correction 
(e-g). Fitted lines indicate nonlinear log(inhibitor) versus response (variable 
slope) (e-g). 
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Extended Data Fig. 5 | Catalytic and clinical mutations of ATP13A2 perturb 
polyamine function. a, Sequence alignment of predicted transmembrane 
helices M4 (left), M6 (middle) and M8 (right). The alignment was generated 
using Clustal Omega. We generated mutants in M4 (A467V), M6 (D962N) and 
M8 (K1062A). The A467V mutation converts the protein sequence PPALP of the 
predicted substrate-binding site in transmembrane segment M4 into the 
protein sequence PPVLP that is present in ATP13A5*°°. Neighbouring 
membrane helices also contribute to substrate coordination in P-type ATPases, 
which often relies on conserved and charged residues, such as D962 in M6 and 
K1062 in M8 of ATP13A2*. b, Densitometry of the expression of catalytic 
mutants presented in Fig. 2e. c, Flowcytometric analysis of cellular BODIPY- 
SPD uptake in SH-SYSY cells overexpressing wild-type ATP13A2, the DSO8N 
mutant or catalytic mutants. d, Quantification of ATP13A2 phosphorylation 
levels (EP) presented in Fig. 2g. e, Densitometry analysis of the expression of 
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disease-related mutants presented in Fig. 2i. f, Flow cytometric analysis of 
cellular BODIPY-SPD uptake in SH-SYSY cells overexpressing wild-type 
ATP13A2, DSO8N or disease mutants. g, Quantification of ATP13A2 
phosphorylation levels presented in Fig. 2k. All data are depicted as mean with 
individual data points (points represent replicates). The number of 
independent biological experiments were as follows: n =3 b, c,d (DSO8N (SPM), 
A467V (SPM), D962N (SPM), and K1062A (SPM)), e (T12M, G872R), f, g (T12M (-), 
T12M (SPM), T5121 (SPM), G528R (SPM), A741T (SPM), and G872R (SPM));n=4d 
(D508N (-), A467V (-), D962N (-), and K1062A (-)), e (wild-type, T5121, 

G528R and A741T), g (wild-type (-), T5121 (-), GS28R (-), A741T (-), and G872R 
(-));n=5d (wild-type (-)); n= 6 d (wild-type (SPM)) and g (wild-type (SPM)). 
Analysis by one-way ANOVA with Dunnett’s (b,c, e, f) or two-way ANOVA with 
Sidak’s (d, g) post-hoc corrections. 
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Extended Data Fig. 6 | The ATP sensitivity of ATP13A2(D962N) and 
ATP13A2(E343A) is independent of SPM. a, Overview of rate constants of 
ATP13A2 phosphoenzyme decay following a chase with non-radioactive ATP 
with or without 1mM SPM. b, After 30s of incubating D962N microsomes (20 
Lg) with [y-*P]ATP, the time course of dephosphorylation after an ATP chase 
was measured inthe presence or absence of SPM. The top panel shows a 
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representative autoradiogram of the phosphoenzymes (EP), whereas the 
bottom panel depicts the quantification of ATP13A2 phosphorylation levels. As 
areference we plotted the wild-type curve, shown in Fig. 1d. Data are presented 
as the mean +s.e.m. of n=4 biologically independent experiments. Analysis by 
two-way ANOVA with Tukey’s test (b). The fitted line indicates two-phase decay 
(b). For gel source data, see Supplementary Fig. 1. 
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Extended Data Fig. 7 | Predicted topology of ATP13A2. a, Homology model of 
ATP13A2 based onthe structure of Na*/K*-ATPase (ATP1A1, PDBID: 3A3Y) asa 
template, generated by iTASSER (https://zhanglab.ccmb.med.umich.edu/I- 
TASSER/)**”. b, Predicted membrane topology of ATP13A2 visualized by 


spastic paraplegia (HSP)-associated mutations in light blue; neuronal ceroid 


lipofuscinosis (NCL)-associated mutations in orange. Catalytic mutations and 
mutations inthe predicted substrate-binding region are highlighted in dark 


blue. Residues that were subjected to mutagenesis in this study are 


Protter® (http://wlab.ethz.ch/protter). ATP13A2 consists of 10 transmembrane 
helices (M1-10) and an N-terminal membrane-associated region (Ma)’. Kufor- 
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labelled (only ina). P-type ATPase signature motifs in the cytosolic domains are 


indicated in pink (only inb). 
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Extended Data Fig. 8 | Lysosomal functionality and recovery. a—d, The 
impact of exogenous polyamines on cell toxicity (24 h) and lysosomal 
functionality (4h) was assessed in SH-SYSY control cells (CON) with 
endogenous ATP13A2, ATP13A2 knockout cells (KO) and rescue cell lines with 
wild-type expression (KO/WT) or expression of the catalytically dead mutant 
D508N (KO/D508N) on the KO background. Cytotoxicity of SPD (a), SPM (b), 
ORN (c) and PUT (d) were assessed via a MUH cell-viability assay. e, Death of the 
aforementioned cells was assessed after 4 h of SPM exposure (10 1M) by 
propidium iodide (PI)-based flow cytometry. f, Measurement of cathepsin D 
activity. g,h, Lysosomal rupture under basal (-) and SPM (10 1M) conditions 
was assessed via loss of FITC-dextran (FITC-DEX) punctae (g) or loss of 
cathepsin B (CTSB)/LAMP1 colocalization (h). Confocal images depict 
representative images with or without SPM exposure (4 h, DAPI staining for 
nuclei was included as a reference). Scale bars, 10 pm. The box and whisker 
plots ing depict the frequency (left) and size (right) of FITC-DEX punctae; inh 
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the Pearson coefficient of colocalization of cathepsin B and LAMP1. 

i, Lysosomal pH (Fig. 3b) was evaluated using the fluorescent probe FITC- 
dextran anda dual-emission ratiometric technique. FITC is excited at 488 nm 
and emission is analysed at 530 nm (BL1) and 610 nm (BL2). A pH calibration 
curve was generated using FITC-dextran in cells permeabilized with 100 uM 
monensin and equilibrated with calibration buffers (pH 3-8).j, Representative 
size distribution of the acidic nanoparticles used in this study. Data are 
presented as the mean +s.e.m. (a-d, i) or individual data points (representing 
replicates) overlaid on group means (e) or box and whisker plots (f-h, 

line, median; box boundaries, 25th and 75th percentiles). The number of 
independent biological experiments were as follows: n=3 (e-j); n=4 (a-d). 
Analysis was performed using two-way ANOVA with Dunnett’s (a, b) or 
Bonferroni's (c, d) post-hoc corrections, or one-way ANOVA with Dunnett’s (e), 
Sidak’s (f, g (right)) or Tukey’s (g (left), h) post-hoc corrections. Fitted lines 
indicate nonlinear log(inhibitor) versus response (variable slope) (a, b). 
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Extended Data Fig. 9 | Inhibition of cathepsin B activity attenuates SPM- 
induced neuronal death. The effect of acathepsin B inhibitor (CA-074, 10 1M) 
on SPM-induced (10 uM, 24 h) cell death in control (miR-Fluc) and Atp13a2 
knockdown (miR-3 and miR-5) neurons was assayed via TUNEL-based staining. 
Left, representative confocal images depicting TUNEL-positive neurons; right, 
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box and whisker plots with the quantification of the TUNEL staining. Dataare 
presented as box and whisker plots (line, median; box boundaries, 25th and 
75th percentiles) for which individual data points (representing replicates) are 
shown. n=3 biologically independent experiments. Analysis by one-way 
ANOVA with Tukey’s post-hoc correction. 


Extended Data Table 1| Apparent K,,, and V,,,,, values for ATP13A2 in the presence of various polyamines 


Microsomal ATP13A2 Purified ATP13A2 Microsomal ATP13A2 Purified ATP13A2 
Spermine 149 + 34 76 +26 140 +6 159 +10 
N?-acetylspermine 286 + 48 not tested 9845 not tested 
Spermidine ~ 1700* 890 +501 ~ 106* 194 +26 


Data are derived from Fig. 1b, f and Extended Data Fig. 1e. Data represent n = 3 (spermidine, N'-acetylspermine), or n = 6 (spermine) biologically independent experiments. *Estimated values 
(could not be accurately determined following fitting of the curve). 
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Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


Noo 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


CO) Uo 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection Typhoon TM FLA 9500 (phosphorimaging) 
Flexstation 3 microplate reader (fluorescent and luminescent assays) 
Bio-Rad ChemiDoc (western blot imager) 
QuantaSmart TM (v2.03) (scintillation counting) 
Attune Cytometric Software (v2.1) (flow cytometry) 
ZEN SP5 2012 (confocal imaging) 


Data analysis GraphPad Prism Version 7.04 was used for data representation and statistical analysis. Image J and Zen 2.3 Lite were used for image 
analysis. Image Quant TL (version 8.1) was used for analysis of gel-based auto-phosphorylation assays. Image Lab (5.2.1) was used to 
analyze Western blots. XCalibur Quan tool (version 4.2.28.14) was used for polyamine quantification (LC MS). Flowing Software (v2.5.1) 
was used for flow cytometry analysis. i-TASSER (Protein struture and function predictions, https://zhanglab.ccmb.med.umich.edu/I- 
TASSER/). PyMOL (structure visualization, version 2.3) 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 
- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Source data for immunoblots and radiograms (Fig. 1-2, Extended data Fig. 1-3, 6) are available with the online version of the paper. All other datasets generated 
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within this study are presented and analysed within this manuscript and are available from the corresponding author upon reasonable request. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


x Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No statistical methods were used to pre-determine sample size. All fundamental experiments were performed a minimum of three times 
(independent biological experiments), including all relevant controls, to allow for the generation of S.E.M. Confirmation of ATP13A2 knockout 
(SH-SY5Y) was performed only once by qRT-PCR and N = 3 by western blot analysis. Knockdown of ATP13A2 in neurons was only confirmed 
two times by qRT-PCR. 
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Data exclusions No data were excluded. 


Replication All experimental findings were reproduced in several independent biological experiments (N) with multiple technical replicates. For each 
figure panel, the number of independent experiments N is indicated in the figure legends. Conclusions were independently confirmed in 
different model systems (in vitro, in cell lines, isolated neurons and in vivo) handled by multiple researchers and across several laboratories. 


Randomization Samples were not randomized, but appropriate controls were included in each figure. For in vivo studies (nematodes), animals were evenly 
distributed such that each group had a similar mean/density at the start of the study. 


Blinding Analyses were not blinded because experiments were performed and analyzed by the same researchers. Biochemistry, cell biology and in vivo 
research were performed independently by different researchers and their findings support one another providing independent confirmation. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


| Human research participants 


LT] Clinical data 


Antibodies 


Antibodies used Primary antibodies used in this study with supplier, catalog number and lot number, as well as the used dilution. 


SIGMA 
anti-ATP13A2 antibody; catalogue number A3361, lot # 076M4839, (Western Blotting, Rabbit, 1:1000) 
anti-GAPDH antibody; catalogue number G8795, lot # 0O76M4785V, (Western Blotting, Mouse, 1:5000) 


ABCAM 

anti-galectin-3 antibody; catalogue number ab2785, lot # GR3200865-11, (Immunofluorescence, Mouse, 1:200) 
anti-cathepsin B antibody; catalogue number ab58802, lot # GR3181508-13, (Immunofluorescence, Mouse, 1:100) 
anti-lamp1 antibody; catalogue number ab24170, lot # GR3235359-1, (Immunofluorescence, Rabbit, 1:200) 


Validation Concerning antibody specificity, we kindly refer to the supplier's websites and datasheets to find statements on specificity and 
citations for the use of the antibodies: 


anti-ATP13A2 antibody; https://www.sigmaaldrich.com/catalog/product/sigma/a3361 ?lang=en&region=BE 
anti-GAPDH antibody; https://www.sigmaaldrich.com/catalog/product/sigma/g8795 ?lang=en&region=BE 
anti-galectin-3 antibody; https://www.abcam.com/galectin-3-antibody-a3a12-ab2785.html 


anti-cathepsin B antibody; https://www.abcam.com/cathepsin-b-antibody-ca10-ab58802.html 
anti-lamp1 antibody; https://www.abcam.com/lamp1-antibody-lysosome-marker-ab24170.html 


Eukaryotic cell lines 


Policy information about cell lines 


Cell line source(s) SH-SY5Y from ATCC (ATCC number CRL-2266™) (lot number 62431864) 
SH-SY5Y in house collection 
HEK-293T from ATCC (https://www.Igcstandards-atcc.org/Products/All/CRL-11268.aspx?geo_country=be) 


Authentication SH-SY5Y (ATCC); were certified by ATCC by STR genotype analysis (https://www.|gcstandards-atcc.org/en/Products/All/ 
CRL-2266.aspx#documentation). 
SH-SYSY cells; (from the in house collection) were authenticated via DNA fingerprinting (Leibniz-Institut DSMZ-Deutsche 
Sammlung von Mikroorganismen und Zellkulturen GmbH). 
HEK-293T (ATCC); were certified by ATCC by STR genotype analysis (https://www.|gcstandards-atcc.org/Products/All/ 
CRL-11268.aspx?geo_country=be#documentation) 
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Mycoplasma contamination All cell lines were routinely tested for mycoplasma contamination using the MycoAlert Mycoplasma Detection kit (LTO7-418) 
and no contaminations have been detected. 


Commonly misidentified lines No commonly misidentified cell lines were used. 
(See ICLAC register) 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Mice - E16 FVB/N mice (females for embryos, neuronal isolation). C. elegans - N2 Bristols (Males and females, larvae to adult 
growth analysis) 


Wild animals No wild animals were used within this research 
Field-collected samples No field collected samples were used within this research 
Ethics oversight All mouse experiments were carried out in accordance with the European Communities Council Directive of November 24, 1986 


(86/609/EEC) and approved by the Bioethical Committee of the KU Leuven (Belgium) (ECD project P185-2014). 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Flow Cytometry 


Plots 


Confirm that: 


The axis labels state the marker and fluorochrome used (e.g. CD4-FITC). 


The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers). 


| |All plots are contour plots with outliers or pseudocolor plots. 


A numerical value for number of cells or percentage (with statistics) is provided. 


Methodology 
Sample preparation Samples were harvested by TRYPLE, washed and stained. Prior to FACS, cells were resuspended in PBS containing BSA, filtered 
and stored on ice. Analysis was performed on living (non-fixed) cells. 
Instrument Attune NxT - Thermo Fisher Scientific 
Software Attune Cytometric Software (v2.1) 


Cell population abundance — Flow cytometry was performed on cell lines and at no point tracked a subpopulation of a heterogeneous cell mixture. We 
assessed 10,000 events of the total cell population, including both living and dead cells. 


Gating strategy Total cells were gated (R1) to remove any signal contamination from debris. Upon selection live/dead cells were characterized in 
(RX/RY) or fluorescence acquired (RZ). In all cases samples were gated on the position of the negative population (unstained) of 
each cell line. Detection was obtained within the relevant detection window. 


Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information. 
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Bacteriophages typically have small genomes’ and depend on their bacterial hosts for 
replication”. Here we sequenced DNA from diverse ecosystems and found hundreds of 
phage genomes with lengths of more than 200 kilobases (kb), including a genome of 
735 kb, which is—to our knowledge—the largest phage genome to be described to 

date. Thirty-five genomes were manually curated to completion (circular and no 
gaps). Expanded genetic repertoires include diverse and previously undescribed 
CRISPR-Cas systems, transfer RNAs (tRNAs), tRNA synthetases, tRNA-modification 
enzymes, translation-initiation and elongation factors, and ribosomal proteins. The 
CRISPR-Cas systems of phages have the capacity to silence host transcription factors 


and translational genes, potentially as part of a larger interaction network that 
intercepts translation to redirect biosynthesis to phage-encoded functions. In 
addition, some phages may repurpose bacterial CRISPR-Cas systems to eliminate 
competing phages. We phylogenetically define the major clades of huge phages from 
human and other animal microbiomes, as well as from oceans, lakes, sediments, soils 
and the built environment. We conclude that the large gene inventories of huge 
phages reflect a conserved biological strategy, and that the phages are distributed 
across a broad bacterial host range and across Earth’s ecosystems. 


Phages—viruses that infect bacteria—are considered distinct from 
cellular life owing to their inability to carry out most biological pro- 
cesses required for reproduction. They are agents of ecosystem change 
because they prey on specific bacterial populations, mediate lateral 
gene transfer, alter host metabolism and redistribute bacterially 
derived compounds through cell lysis” *. They spread antibiotic resist- 
ance’ and disperse pathogenicity factors that cause disease in humans 
and animals®’. Most knowledge about phages is based on laboratory- 
studied examples, the vast majority of which have genomes that area 
few tens of kb in length. Widely used isolation-based methods select 
against large phage particles, and they can be excluded from phage 
concentrates obtained by passage through 100-nm or 200-nm filters. 
In 2017, only 93 isolated phages with genomes that were more than 


200 kb in length were published’. Sequencing of whole-community 
DNA can uncover phage-derived fragments; however, large genomes 
can still escape detection owing to fragmentation®. A new clade of 
human- and animal-associated megaphages was recently described 
on the basis of genomes that were manually curated to completion 
from metagenomic datasets’. This finding prompted us to carry outa 
more-comprehensive analysis of microbial communities to evaluate 
the prevalence, diversity and ecosystem distribution of phages with 
large genomes. Previously, phages with genomes of more than 200 kb 
have been referred to as ‘jumbophages” or, in the case of phages with 
genomes of more than 500 kb, as megaphages’. As the set reconstructed 
here span both size ranges we refer to them simply as ‘huge phages’. 
A graphical abstract provides an overview of our approach and main 
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Fig.1| Distribution of the genome sizes and tRNAs of phages. a, Size 
distribution of circularized bacteriophage genomes from this study, Lak 
megaphage genomes reported recently for asubset of the same samples’ and 
reference sources. Reference genomes were collected from all complete 
RefSeq r92 dsDNA genomes and non-artefactual assemblies with lengths of 
more than 200 kb froma previous study“. b, Histogram of the genome size 


findings (Extended Data Fig. 1). This study expands our understanding 
of phage biodiversity and reveals the wide variety of ecosystems in 
which phages have genomes with sizes that rival those of small-celled 
bacteria’ ”. We postulate that these phages have evolved a distinct 
‘life’ strategy that involves extensive interception and augmentation 
of host biology while they replicate their huge genomes. 


Ecosystem sampling 

Metagenomic datasets were acquired from human faecal and oral sam- 
ples, faecal samples from other animals, freshwater lakes and rivers, 
marine ecosystems, sediments, hot springs, soils, deep subsurface 
habitats and the built environment (Extended Data Fig. 2). Genome 
sequences that were clearly not bacterial, archaeal, archaeal virus, 
eukaryotic or eukaryotic virus were classified as phage, plasmid-like 
or mobile genetic elements of uncertain nature on the basis of their 
gene inventories (Supplementary Information). De novo assembled 
fragments close to or more than 200 kb in length were tested for cir- 
cularization and a subset was selected for manual verification and 
curation to completion (Methods). 


Genome sizes and basic features 


We reconstructed 351 phage sequences, 6 plasmid-like sequences 
and 4 sequences of unknown classification (Extended Data Fig. 2). We 
excluded additional sequences that were inferred to be plasmids (Meth- 
ods), retaining only those that encoded CRISPR-Cas loci. We included 
3 phage sequences of <200 kb in length owing to the presence of CRISPR- 
Cas loci. Consistent with the classification as phages, we identified a 
wide variety of phage-relevant genes, including those involved in lysis 
and encoding structural proteins, and documented other expected 
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distribution of phages with genomes of more than 200 kb from this study, Lak 
and reference genomes. Box-and-whisker plot of tRNA counts per genome 
from this study and Lak phages as a function of genome size (Spearman’s 
p=0.61, P=4.5 x10, n=201 individual phage genomes). The middle line for 
each box marks the median tRNA count for each size bin, the box marks the 
interquartile range, and the whiskers represent the maximum and minimum. 


genomic features of phages (Supplementary Information). Some pre- 
dicted proteins were large, up to 7,694 amino acids in length; some were 
tentatively annotated as structural proteins. In total, 175 phage sequences 
were circularized and 35 were manually curated to completion, insome 
cases by resolving complex repeat regions, revealing their encoded 
proteins (Methods and Supplementary Table 1). The remaining genomes 
are probably incomplete, although some may be complete, but linear. 
Approximately 30% of genomes show clear GC skew indicative of bidi- 
rectional replication and 30% have patterns indicative of unidirectional 
replication” (Extended Data Fig. 3 and Supplementary Information). 
Our 4 largest complete, manually curated and circularized phage 
genomes are 634, 636, 642 and 735 kbin length and are—to our knowl- 
edge—the largest phage genomes reported to date. The largest previ- 
ously reported circularized phage genome was 596 kbin length”. The 
same previous study also reported a circularized genome of 630 kbin 
length; however, this is an assembly artefact (Supplementary Infor- 
mation). The problem of concatenation artefacts was sufficiently 
prominent in IMG/VR* that we did not include these data in further 
analyses. We used both complete and circularized genomes from our 
study and published phage genomes to produce an updated view of 
the distribution of phage genome sizes (Methods). Without the huge 
phages reported here, the median genome size for complete phages is 
around 52 kb (Fig. 1a). Thus, the sequences reported here substantially 
expand the inventory of phages with unusually large genomes (Fig. 1b). 
Some of our reported genomes have a very low coding density 
(9 genomes have densities of less than 78%) (Supplementary Informa- 
tion), probably owing to the use of a genetic code that is different from 
the standard code (Methods). This phenomenon has been rarely noted 
in phages, but has previously been reported for the Lak phage’ andina 
previous study”. Inthe current study, some genomes (mostly those that 
are associated with humans and/or animals) appear to have reassigned 
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Fig. 2| Phylogenetic reconstruction of the evolutionary history of huge 
phages. The phylogeny of phages was reconstructed using large terminase 
sequences from this study (n=397) and similar matches from all RefSeq r92 
proteins (n=532). The tree also includes large terminase sequences from 
complete RefSeq phage, the Lak megaphage clade? (n= 9) and non-artefactual 
phage genomes that are more than 200 kb, froma previous study”. Huge phage 
clades identified in this study were independently corroborated witha 
phylogenetic reconstruction of major capsid protein (MCP) genes (Extended 


the UAG (amber) stop codon to encode an amino acid (Extended Data 
Fig. 4 and Supplementary Information). 

In only one case, we identified a sequence of more than 200 kb that 
was classified as a prophage on the basis of the transition into a flanking 
bacterial genome sequence. However, around half of the genomes were 
not circularized, so their potential integration as a prophage cannot be 
ruled out. The presence of integrases in some genomes is suggestive 
of atemperate lifestyle under some conditions. 


Hosts, diversity and distribution 


Anintriguing question relates to the evolutionary history of phages with 
huge genomes; namely, whether they are the result of recent genome 


Data Fig. 5a) and protein clustering (Extended Data Fig. 5b). The tree was 
rooted using eukaryotic herpesvirus terminases (n=7). Theinner to outer rings 
display the presence of CRISPR-Cas in this study, host phylum, environmental 
sampling type and genome size. Host phylum and genome size were not 
included for RefSeq protein database matches for which the sequence may be 
from an integrated prophage or part of organismal genome projects. Scale bars 
showthe number of substitutions per site (left) and number of base pairs 
(right). 


expansion within clades of normal-sized phages or whether a large 
inventory of genes is an established, persistent strategy. To investigate 
this, we constructed phylogenetic trees for large terminase subunit 
proteins (Fig. 2) and major capsid proteins (Extended Data Fig. 5a) 
using sequences from public databases as a context (Methods). Many 
of the sequences from our phage genomes cluster together with high 
bootstrap support, thus defining clades. Analysis of the genome size 
information for database sequences shows that the public sequences 
that fall into these clades are from phages with genomes of at least 
120 kbinlength. The largest clade, referred to here as Mahaphage (Maha 
being Sanskrit for huge), includes all of our largest genomes as well as 
the 540-552 kb Lak genomes from human and animal microbiomes’. 
We identified nine other clusters of large phages, and refer to them 


Nature | Vol578 | 20 February 2020 | 427 


Article 


Transcription 


Translation 


Initiation 


@ 


/D/F 
Bacteria gues 


Initiation Elongation 


TFIIB tr 
Q 
RNA Pol 


o> 
Avenue? 
WOH MOY 
RNA Pol , ae 
L7/L12 
©@ 


Elongation Termination 


aaRS 
° 
4 9 tS RF2 tre) 
‘a 
@ 


CCA adding 


Ribosome 


WAL 


cho Phage tRNA 
© Sigma factor 


Fig. 3| A model for phage interception and redirection of host translational 
systems. Potential mechanisms for how phage-encoded capacities could 
function to redirect the translational system of the host to produce phage 
proteins (bacterial components in blue, phage proteins in red). No huge phage 
encodesalltranslation-related genes, but many have tRNAs and tRNA 
synthetases (Supplementary Table 6). Phage proteins with up to six ribosomal 
protein S1 domains occur ina few genomes. The S1 binds to mRNA to bring it 
into the site on the ribosome where it is decoded”’. Phage ribosomal protein S21 
might promote translation initiation of phage mRNAs, and many sequences 


using the words for ‘huge’ in the languages of some authors of this 
paper. We acknowledge that the detailed tree topologies for different 
genes and datasets vary slightly; however, the clustering is broadly sup- 
ported by protein family and capsid analyses (Extended Data Fig. 5a, b). 
The fact that large phages are consistently grouped together into 
clades establishes that a large genome size is a relatively stable 
trait. Within each clade, phages were sampled from a wide variety of 
environment types (Fig. 2), indicating the diversification of these huge 
phages and their hosts across ecosystems. We also examined the envi- 
ronmental distribution of phages that are so closely related that their 
genomes can be aligned and we found 20 cases in which the phages 
occur in at least 2 distinct cohorts or habitat types (Supplementary 
Table 2). 

To determine the extent to which bacterial host phylogeny corre- 
lates with phage clades, we identified some phage hosts using CRISPR 
spacer targeting from bacteria in the same or related samples and 
phylogenies of normally host-associated phage genes (see below, 
Supplementary Table 3). We also tested the predictive value of bacte- 
rial taxonomic affiliations of the phage gene inventories (Methods) 
and found that in every case, CRISPR spacer targeting and phylogeny 
agreed with phylum-level taxonomic profiles. We therefore used 
taxonomic profiles to predict the bacterial host phylum for many 
phages (Supplementary Table 4). The results establish the impor- 
tance of Firmicutes and Proteobacteria as hosts (Extended Data Fig. 2) 
(P=2.5x10°, n=74, W= 606; one-sided Wilcoxon signed-rank test). 
The higher prevalence of Firmicutes-infecting huge phages in the 
human and animal gut compared with other environments reflects the 
potential host compositions of the microbiomes (P= 9.3 x 10”, n=37, 
U=238; one-sided Mann-Whitney U-test). Notably, the 5 genomes 
that were more than 634 kb in length were all from phages that were 
predicted to replicate in Bacteroidetes, as do Lak phages’, and all 
cluster within the Mahaphage clade. Overall, phages that grouped 
together phylogenetically are predicted to replicate in bacteria of 
the same phylum (Fig. 2). 
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have N-terminal extensions that may be involved in binding RNA (dashed blue 
line in ribosome insert (RCSB Protein Data Bank (PDB) code: 6BU8*°)), analysed 
with UCSF Chimera®. Many other proteins of the translational apparatus that 
belong toall steps of the translation cycle are encoded by huge phages. aaRS, 
aminoacyl-tRNA synthetase; CCA-adding, tRNA nucleotidyltransferase; EF, 
elongation factor; IF, initiation factor; PDF, peptide deformylase; QueC/D/F, 
queuosine synthesis and tRNA modification; RF, release factor; RNA Pol, RNA 
polymerase; RRF, ribosome recycling factor; TFIIB, transcription factor IIB. 


Metabolism, transcription and translation 


The phage genomes encode proteins that are predicted to localize 
to the bacterial membrane or cell surface. These may affect the sus- 
ceptibility of the host to infection by other phages (Supplementary 
Table 5 and Supplementary Information). We identified almost all of 
the previously reported categories of genes that have been suggested 
to augment host metabolism (Supplementary Information). Many 
phages have genes involved in the de novo biosynthesis of purines and 
pyrimidines, and the interconversion of nucleic and ribonucleic acids 
and nucleotide phosphorylation states. These gene sets are intriguingly 
similar to those of bacteria with very small cells and putative symbiotic 
lifestyles’ (Supplementary Table 5). 

Notably, many phages have genes with predicted functions in tran- 
scription and translation (Supplementary Table 6). Complete phage 
genomes encode up to 67 tRNAs, with sequences that are distinct from 
those of their hosts (Supplementary Table 7). Generally, the number 
of tRNAs per genome increases with genome length (Fig. 1) (Spear- 
man’s p= 0.61, P=4.5 x10 ”, n=201). Huge phages have up to15tRNA 
synthetases per genome (Supplementary Table 7), which are also dis- 
tinct from but related to those of their hosts (Extended Data Fig. 7a 
and Supplementary Information). Phages may use these proteins to 
charge their owntRNA variants with host-derived amino acids. A subset 
of genomes has genes for tRNA modification and ligation of tRNAs 
cleaved by host defenses. 

Many phages carry genes that are implicated in the interception and 
redirection of host translation. These genes include the initiation factors 
IFland IF3, as well as ribosomal proteins S4, S1, S21 and L7/L12 (ribosomal 
proteins were only recently reported in phages” (Fig. 3)). Both rpS1 and 
rpS21are important for translation initiation in bacteria’® °, making them 
likely to be useful for the hijacking of host ribosomes. Further analysis of 
rpS21 proteins revealed N-terminal extensions that were rich in basic and 
aromatic residuesimportant for RNA binding. We predict that these phage 
ribosomal proteins substitute for host proteins”, and their extensionsassist 
incompetitive ribosome binding or preferential initiation of phage mRNAs. 
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Fig. 4 | Phage and bacterial CRISPR-interaction dynamics. a, Cell diagram of 
bacterium-phage and phage-phage interactions that involve CRISPR targeting 
during superinfection. Arrows indicate CRISPR-Cas targeting of the prophage 
and phage genomes. Phage names indicate related groups delineated by 
whole-genome alignment. We only included CRISPR interactions from samples 
of subjects of the same human cohort. b, Maximum likelihood phylogenetic 
tree of Cas12 subtypes a-i. Phage-encoded Cas12i and Cas, the new effector, 
are outlined in red, with bacteria-encoded proteins in blue. Bootstrap values 
>90 are shown onthe branches (circles). Cas14 and type V-U trees are provided 


Because rpS1 is often studied in the context of Shine-Dalgarno 
sequence recognition by the ribosome””®, we predicted the ribosomal 
binding sites for each phage genome (Methods). Whereas most phages 
have canonical Shine-Dalgarno sequences, huge phages from this 
study that carry possible rpS1s rarely have identifiable Shine-Dalgarno 
sequences (Supplementary Information and Supplementary Table 8). 
Itis difficult to confirm ‘true’ rpS1 proteins owing to the ubiquity of the 
S1 domain, but this correlation with non-canonical Shine-Dalgarno 
sequences suggests a role in translation initiation, either on or off the 
ribosome. 
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separately (Supplementary Fig. 11). Scale bars indicates the number of 
substitutions per site. c, Top, alignment of the consensus repeats from the A9 
phage array and predicted host bacterial arrays. Bottom, interaction network 
showing the targeting of bacteria-encoded (blue) and phage-encoded (red) 
CRISPR spacers. The number of edges indicate the number of spacers fromthe 
array with targets to the smaller node. Solid edges denote spacer targets with 
no or one mismatch, and dashed edges denote two to three mismatches (to 
account for degeneration in old-end phage spacers, diversity in different 
subjects or phage mutation to avoid targeting). 


Although assuming control of initiation may be the most logical 
step for the redirection of host translation by the phage, improving 
the efficiency of elongation and termination is necessary for robust 
infection and replication. Accordingly, we found many genes associ- 
ated with the later steps of translation in phage genomes. These include 
elongation factors G, Tu and Ts, rpL7/12 and the processing enzyme 
peptide deformylase (Fig. 3), which has previously been reported in 
phage genomes”. We hypothesize that phage-encoded elongation 
factors maintain the overall translation efficiency during infection, 
much like the previously predicted role of peptide deformylase in 


Nature | Vol578 | 20 February 2020 | 429 


Article 


sustaining translation of the necessary photosynthetic proteins of the 
host”. Translation termination factors are also represented in our huge 
phage genomes, including release factor 1 and 2, ribosome recycling 
factor, as well as transfer messenger RNAs (tmRNAs) and small protein 
B (SmpB), which rescue ribosomes stalled on damaged transcripts 
and trigger the degradation of aberrant proteins. These tmRNAs are 
also used by phages to sense the physiological state of host cells and 
can induce lysis when the number of stalled ribosomes in the host is 
high”. Notably, some large putative plasmids have analogous suites 
of translationrelevant genes (Supplementary Table 5). 


CRISPR-Cas-mediated interactions 


We identified most major types of CRISPR-Cas systems in phages, 
including Cas9-based type II, the recently described type V-I?, new 
variants of the type V-U systems” and new subtypes of the type V-F 
system” (Extended Data Fig. 8). The class II systems (types Il and V) 
have not previously been reported in phages. Most phage effector 
nucleases (for interference) have conserved catalytic residues, imply- 
ing that they are functional. 

Incontrast to the well-described case ofa phage witha CRISPRsystem”, 
almost all phage CRISPR systems lack spacer acquisition machinery 
(Cas1, Cas2 and Cas4) and many lack recognizable genes for interference 
(Extended Data Fig. 9 and Supplementary Table 1). For example, two 
related phages have a type I-C variant system that lacks Casl and Cas2 
and have a helicase protein instead of Cas3. These phages also have a 
second system that contains a new candidate type V effector protein, 
Cas® (Cas12j), which is approximately 750 amino acids in length (Fig. 4 
and Supplementary Table 1), which occurs proximal to CRISPR arrays. 

In some cases, phages that lack genes for interference and spacer 
integration have similar CRISPR repeats as their hosts (Fig. 4c) and 
may therefore use the Cas proteins of the host. Alternatively, systems 
that lack an effector nuclease may repress the transcription of the 
target sequences without cleavage’””*. Additionally, spacer-repeat 
guide RNAs may have an RNA-interference-like mechanism to silence 
host CRISPR systems or nucleic acids to which they can hybridize. The 
phage-encoded CRISPR arrays are often compact (median, six repeats 
per array) (Extended Data Fig. 10). This range is substantially smaller 
than typically found in prokaryotic genomes (mean of 41 repeats for 
class I systems)*’. Some phage spacers target core structural and regu- 
latory genes of other phages (Fig. 4c and Supplementary Table 10). 
Thus, phages apparently augment the immune arsenal of their hosts 
to prevent infection by competing phages. 

Some phage-encoded CRISPR loci have spacers that target bacteriain 
thesame sample orinasample from thesame study. We suppose that the 
targeted bacteria are the hosts for these phages, an inference supported 
by other host prediction analyses (Supplementary Table 4). Some loci 
with bacterial chromosome-targeting spacers encode Cas proteins that 
could cleave the host chromosome, whereas others do not. The target- 
ing of host genes could disable or alter their regulation, which may be 
advantageous during the phage infection cycle. Some phage CRISPR 
spacers target bacterial intergenic regions, possibly interfering with 
genome regulation by blocking promoters or silencing non-coding RNAs. 

Notable examples of CRISPR targeting of bacterial chromosomes 
involve transcription and translation genes. For instance, one phage 
targets ao” transcription factor gene in the genome of its host and 
encodes its own o”° (Supplementary Information). Some huge phage 
genomes encode anti-sigma factor-like proteins (AsiA), consistent 
with previous reports of 0” hijacking by phages with AsiA*’. In another 
example, a phage spacer targets the host glycyl tRNA synthetase, but the 
Cas14 effector lacks one of the required catalytic residues for cleavage, 
suggesting a role in repression (as a ‘dCas14), rather than in cleavage 
(Supplementary Information). 

Notably, we found no evidence of host-encoded spacers that target 
any CRISPR-bearing phages. However, phage CRISPR targeting of other 
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phages that are also targeted by bacterial CRISPR (Fig. 4c) suggested 
phage-host associations that were broadly confirmed by the phage 
taxonomic profile (Supplementary Table 4). 

Some large Pseudomonas-infecting phages encode anti-CRISPRs*” 
(Acrs) and proteins that assemble a nucleus-like compartment that 
segregates their replicating genomes from host-defence and other 
bacterial systems**. We identified proteins encoded in huge phage 
genomes that cluster with AcrVAS, AcrVA2, AcrIIA7 and AcrlIA11 and 
may function as Acrs. We also identified tubulin homologues (PhuZ) 
and proteins (Supplementary Information) that create a proteinaceous 
phage ‘nucleus™. The phage nucleus was recently shown to protect the 
phage genome against host defence by physically blocking degradation 
by CRISPR-Cas systems*. 


Conclusions 


We show that phages with huge genomes are widespread across 
Earth’s ecosystems. We manually completed 35 genomes, distinguish- 
ing them from prophages, providing accurate genome lengths and 
complete inventories of genes, including those encoded in complex 
repeat regions that break automated assemblies. Even closely related 
phages have diversified across habitats. Host and phage migration 
could transfer genes relevant to medicine and agriculture (for example, 
genes that affect pathogenicity and antibiotic resistance) (Supple- 
mentary Information). Additional mechanisms that are relevant to 
medical applications involve the direct or indirect activation of immune 
responses. For example, some phages directly stimulate IFNy through 
a TLR9-dependent pathway and exacerbate colitis**. Huge phages may 
represent a reservoir of novel nucleic acid manipulation tools with 
applications in genome editing and might be harnessed to improve 
human and animal health. For instance, huge phages equipped with 
CRISPR-Cas systems might be tamed and used to modulate the func- 
tions of the bacterial microbiome or eliminate unwanted bacteria. 

The huge phages comprise extensive clades, suggesting thata geneinven- 
tory comparablein size to those of many symbiotic bacteriais aconserved 
strategy for phage survival. Overall, their genes appear to redirect the protein 
production capacity of the host to favour phage genes by firstintercepting 
the earliest steps of translation and subsequently ensuring the efficient 
production of proteins. These inferences are aligned with findings for some 
eukaryotic viruses, which control every phase of protein synthesis”. Some 
phages acquired CRISPR-Cas systems with unusual compositions that may 
function to control host genes and eliminate competing phages. 

More broadly, huge phages represent little-known biology, the 
platforms for which are distinct from those of small phages and par- 
tially analogous to those of symbiotic bacteria, blurring the distinc- 
tions between life and non-life. Given phylogenetic evidence for large 
radiations of huge phages, we wonder whether they are ancient and 
arose simultaneously with free-living cells, their symbionts and other 
phages froma pre-life (protogenote) state** rather than appearing 
more recently through episodes of genome expansion. 
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Methods 


Phage- and plasmid-genome identification 

Datasets generated inthe current study, those from previous research 
conducted by our team, the Tara Oceans microbiomes” and the Global 
Oceans Virome* were searched for sequence assemblies that could 
have derived from phages with genomes of more than 200 kb in length. 
Read assembly, gene prediction and initial gene annotation followed 
standard, previously reported methods** “8. 

Phage candidates were initially found by retrieving sequences that 
were not assigned to a genome and had no clear taxonomic profile at 
the domain level. Taxonomic profiles were determined througha voting 
scheme, in which the winning taxonomy had to have more than 50% 
votes for each taxonomic rank on the basis of protein annotations in 
the UniProt and ggKbase (https://ggkbase.berkeley.edu/) databases”. 
Phages were further narrowed down by identifying sequences with a 
high number of hypothetical protein annotations and/or the presence 
of phage-specific genes, such as capsid, tail, terminase, spike, holin, 
portal and baseplate. All candidate phage sequences were checked 
throughout to distinguish putative prophages from phages. Prophages 
were identified on the basis of a clear transition into the host genome 
with a high fraction of confident functional predictions, often associ- 
ated with core metabolic functions and much higher similarity to bacte- 
rial genomes. Plasmids were distinguished from phages onthe basis of 
matches to plasmid partitioning and conjugative transfer genes. Those 
that did not have phage-specific genes were assigned using phyloge- 
netic tree placement using recA, polA, polB, dnaE and the DNA sliding 
clamp loader gene. Phages and placement assignments were further 
verified using a network of protein clustering with proteins from RefSeq 
prokaryotic viruses and 400 randomly sampled plasmids of more than 
200 kb using vContact2°° (Extended Data Fig. 6). 


Phage- and plasmid-genome manual curation 

All classified scaffolds were tested for end overlaps indicative of cir- 
cularization. Assembled sequences that could be perfectly circular- 
ized were considered potentially complete. Erroneous concatenated 
sequence assemblies were initially flagged by searching for direct 
repeats of more than 5 kb using Vmatch”. Potentially concatenated 
sequence assemblies were manually checked for multiple large repeat- 
ing sequences using the dotplot and RepeatFinder features in Geneious 
v.9. Sequences were corrected and removed from further analysis if the 
corrected length was more than 200 kb. 

A subset of the phage sequences was selected for manual curation, 
with the goal of finishing (replacing all Ns at scaffolding gaps or local 
misassemblies by the correct nucleotide sequences and circulariza- 
tion). Curation generally followed previously described methods’. In 
brief, reads from the appropriate dataset were mapped using Bowtie2 
v.2.3.4.1” to the de novo assembled sequences. Unplaced mate pairs 
of mapped reads were retained with shrinksam (https://github.com/ 
bcthomas/shrinksam). Mappings were manually checked throughout 
to identify local misassemblies using Geneious v.9. N-filled gaps or mis- 
assembly corrections made use of unplaced paired reads, insome cases 
using reads relocated from sites to which they were mismapped. In such 
cases, mismappings were identified on the basis of much larger than 
expected paired read distances, high polymorphism densities, back- 
wards mapping of one read pair or any combination of these. Similarly, 
ends were extended using unplaced or incorrectly placed paired reads 
until circularization could be established. In some cases, extended 
ends were used to recruit new scaffolds that were then added to the 
assembly. The accuracy of all extensions and local assembly changes 
were verified in a subsequent phase of read mapping. In many cases, 
assemblies were terminated or internally corrupted by the presence 
of repeated sequences. In these cases, blocks of repeated sequences 
as well as unique flanking sequences were identified. Reads were then 
manually relocated, respecting paired-read placement rules and unique 


flanking sequences. After gap closure, circularization and verifica- 
tion of accuracy throughout, end overlap was eliminated, genes were 
predicted and the start moved to an intergenic region, which was—in 
some cases—suspected to be origin on the basis of a combination of 
coverage trends and GC skew™. Finally, the sequences were checked 
toidentify any repeated sequences that could have led to an incorrect 
path choice because the repeated regions were larger than the distance 
spanned by paired reads. This step also ruled out artefactual long phage 
sequences generated by end-to-end repeats of smaller phages, which 
occur in previously described datasets’. 


Structural and functional annotations 
After the identification and curation of phage genomes, coding 
sequences and Shine-Dalgarno ribosomal binding site motifs were 
predicted with Prodigal using genetic code 11 (-m -g 11-p single). The 
resulting coding sequences were annotated as previously described 
by searching UniProt, UniRefl00 and KEGG™. Functional annotations 
were further assigned by searching proteins in PFAM r32, TIGRFAMS 
115°, Virus Orthologous Groups (VOG) r90 (http://vogdb.org/) and 
Prokaryotic Virus Orthologous Groups” (pVOG). tRNAs were identi- 
fied with tRNAscan<-s.e. v.2.0° using the bacterial model. tmRNAs were 
assigned using ARAGORN v.1.2.38° with the genetic code of bacteria 
and plant chloroplasts. 

Clustering of the coding sequences into families was achieved using 
a two-step procedure. A first protein clustering was done using the 
fast and sensitive protein-sequence searching software MMseqs™. An 
all-versus-all sequences search was performed using an F-value cut-off 
of 1x 10°, sensitivity of 7.5 and coverage of 0.5. A sequence similarity 
network was built on the basis of the pairwise similarities and the greedy 
set cover algorithm from MMsegqs was performed to define protein 
subclusters. The resulting subclusters were defined as subfamilies. To 
test for distant homology, we grouped subfamilies into protein families 
using acomparison of hidden Markov models (HMMs). The proteins of 
each subfamily with at least two protein members were aligned using 
the result2msa parameter of MMseqs, and HMM profiles were built 
using the HHpred suite from the multiple sequence alignments. The 
subfamilies were then compared to each other using HHblits from the 
HHpred suite (with parameters -v 0 -p 50 -z 4 -Z 32000 -B 0 -b O). For 
subfamilies with probability scores of at least 95% and coverage at least 
0.50, a similarity score (probability x coverage) was used as weight of 
the input network in the final clustering using the Markov clustering 
algorithm”, with 2.0 as the inflation parameter. These clusters were 
defined as the protein families. Protein sequences were functionally 
annotated on the basis of their best hmmsearch match (v.3.1) (E-value 
cut-off 1 x 10°) against an HMM database constructed on the basis of 
orthologous groups defined by the KEGG database® (downloaded on 
10 June 2015). Domains were predicted using the same hmmsearch 
procedure against the PFAM r31 database®. The domain architecture 
of each protein sequence was predicted using the DAMA software™* 
(default parameters). SIGNALP® (v.4.1) (parameters, -f short -t gram+) 
and PSORT® v.3 (parameters, --long --positive) were used to predict 
the putative cellular localization of the proteins. Prediction of trans- 
membrane helices in proteins was performed using TMHMM*” (v.2.0) 
(default parameters). Hairpins (palindromes, based on identical over- 
lapping repeats in the forward and reverse directions) were identi- 
fied using the Geneious Repeat Finder and located across the dataset 
using Vmatch*". Repeats of more than 25 bp with 100% similarity were 
tabulated. 


Reference genomes for size comparisons 

RefSeq r92 genomes were recovered using the NCBI Virus portal and 
selecting only complete dsDNA genomes with bacterial hosts. Genomes 
froma previously published study“ were downloaded fromIMG/VR and 
only sequence assemblies that were labelled ‘circular’ with predicted 
bacterial hosts were retained. Given the presence of sequences in IMG/ 


VR that were based on erroneous concatenations, we only considered 
sequences from this source that were more than 200 kb; however, 
a subset of these was removed as artefactual sequences. 


Alternative genetic codes 

In cases in which the gene prediction using the standard bacterial code 
(code 11) resulted in seemingly anomalously low coding densities, 
potential alternative genetic codes were investigated. In addition to 
making a prediction using the fast and accurate genetic code inference 
and logo®’ (FACIL) web server, we identified genes with well-defined 
functions (for example, polymerase or nuclease) and determined the 
stop codons terminating genes that were shorter than expected. We 
then repredicted genes using GLIMMER3 v.1.5° and Prodigal with TAG 
not interpreted as a stop codon. Other combinations of repurposed 
stop codons were evaluated and candidate codes (for example, code 
6, with only one stop codon) were ruled out owing to unlikely gene- 
fusion predictions. 


Large terminase subunit and MCP phylogenetic analyses 

The phylogenetic tree of the large terminase subunit was constructed 
by recovering large terminases from the aforementioned protein-clus- 
tering and annotation pipeline. The coding sequences that matched 
with >30 bitscore to PFAM, TIGRFAMS, VOG and pVOG were retained. 
Any coding sequence that had a hit to large terminase, regardless of 
bitscore, was searched using HHblits” against the uniclust30_2018 08 
database. The resulting alignment was then further searched against 
the PDB70 database. Remaining coding sequences that clustered 
in protein families with a large terminase HMM were also included 
after manual verification. Detected large terminases were manually 
verified using the HHPred” and jPred” webservers. Large terminases 
from the >200-kb phage genomes" and all >200-kb complete dsDNA 
phage genomes from RefSeq r92 were also included by protein fam- 
ily clustering with the phage-coding sequences from this study. The 
resulting terminases were clustered at 95% amino acid identity to 
reduce redundancy using CD-HIT”. Smaller phage genomes were 
included by searching the resulting coding sequences set against 
the full RefSeq protein database and retaining the top 10 best hits. 
Those hits that had no large terminase match against PFAM, TIGR- 
FAMS, VOG or pVOG were removed from further consideration and 
the remaining set was clustered at 90% amino acid identity. The final 
set of large terminase coding sequences that were more than 100 
amino acids in length were aligned using MAFFT” v.7.407 (--localpair 
--maxiterate 1000) and poorly aligned sequences were removed and 
the resulting set was realigned. The phylogenetic tree was inferred 
using IQTREE v.1.6.6 using automatic model selection”. The phylo- 
genetic tree of MCP genes was constructed by retrieving all MCPs 
annotated by combining the PFAM annotations of protein families 
and direct annotations by PFAM, TIGRFAMS, VOG and pVOG. Refer- 
ence MCP gene sequences were collected using the same strategy 
and sources as for the large terminase subunit tree. The resulting set 
was further screened by searching against PFAM, TIGRFAMS, VOG 
and pVOG and removing matches that had no large terminase match 
regardless of bitscore. The final set of MCP sequences were aligned 
with MAFFT(--localpair --maxiterate 1000) and the phylogenetic tree 
was constructed using IQTREE with automatic model selection and 
1,000 bootstrap replicates. 


Whole-genome scale clustering 

To identify phage genomes that were closely related at the whole- 
genome level, we compared sequences using whole-genome align- 
ments. The goal of this analysis was to further corroborate the identified 
phylogenetic clades and test for the presence of very similar phages 
in different habitats and environments. Genomes grouped together 
in the primary clusters from dRep v.2” were evaluated for genome 
alignment using Mauve” within Geneious v.9. 


CRISPR-Cas locus and target detection 

Phage- and host-encoded CRISPR loci (repeats and spacers) were identi- 
fied using a combination of MinCED (https://github.com/ctSkennerton/ 
minced) and CRISPRDetect”. A custom database of Cas genes was 
built by collecting Cas gene sequences from previous studies”*>’> 
and built with MAFFT (--localpair --maxiterate 1000) and hmmbuild. 
The coding sequences from this study were searched against the HMM 
database using hmmsearch with £ <1 x 10°. Matches were checked 
using a combination of hmmscan and BLAST searches against the 
NCBI nr database and manually verified by identifying colocated 
CRISPR arrays and Cas genes. Spacers extracted from between repeats 
of the CRISPR locus were compared to sequences assemblies from 
the same site using BLASTN-short®. Matches with alignment length 
>24 bp and <1 mismatch were retained and targets were classified as 
bacteria, phage or other. CRISPR arrays that had <1 mismatch, were 
further searched for more spacer matches in the target sequence by 
finding more hits with <3 mismatches. 


Host identification 

The phylum affiliations of bacterial hosts for phage and plasmid-like 
sequences were predicted by considering the UniProt taxonomic pro- 
files of every coding sequence for each phage genome. The phylum 
level matches for each phage genome were summed and the phylum 
with the most hits was considered as the potential host phylum. How- 
ever, only cases in which this phylum that had 3x as many counts as 
the next most-counted phylum were assigned as the tentative phage 
host phylum. Phage hosts were further assigned and verified using 
the CRISPR-targeting strategy describe above with the phage and 
plasmid-like genomes as targets. CRISPR arrays were predicted on all 
sequence assemblies from the same site that each phage genome was 
reconstructed. Sequence assemblies containing spacers witha match 
of length >24 bp and <1 mismatch were used to infer phage-host rela- 
tionships. In all cases, the predicted host phylum based on taxonomic 
profiling and CRISPR targeting were in complete agreement. Similarly, 
the phyla of hosts were predicted on the basis of phylogenetic analysis 
of phage genes also found in host genomes (for example, involved in 
translation and nucleotide reactions). Inferences based on computed 
taxonomic profiles and phylogenetic trees were also in complete agree- 
ment. 


Phage-encoded tRNA synthetase trees 

Phylogenetic trees were constructed for phage-encoded tRNA syn- 
thetase, ribosomal and initiation factor protein sequences using a set 
of the closest reference sequences from NCBI and bacterial genomes 
from the current study. The tRNA synthetases were identified on the 
basis of annotation of genes using the standard ggKbase pipeline (see 
above), and confirmed by HMMs with datasets from TIGRFAMS. For 
each type of tRNA synthetase, references were selected by comparing 
all of the corresponding genes of this type against the NCBI nr data- 
base using DIAMOND v.0.9.24*, their top 100 hits were clustered by 
CD-HIT using a 90% similarity threshold”. The phylogenetic tree of 
each tRNA synthetase was constructed using RAXML v.8.0.26® with 
the PROTGAMMALG model. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


GenBank files for all genomes are provided as Supplementary Informa- 
tion. Sequence reads and genomes have been deposited at the European 
Nucleotide Archive (ENA) under project PRJEB35371. Genomes have 
been deposited at ENA under accessions ERS4026114-ERS4026474. 
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Reads are available at ENA under accessions ERS4025670-ERS4025731. 
Read accessions and genome accessions for each phage genome are 
included in Supplementary Table 1. 


Code availability 


The custom code used to analyse the genomes is available at http:// 
www.github.com/rohansachdeva/assembly_repeats. 
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Extended Data Fig. 2| Ecosystems with phage genomes and plasmid-like 
sequences of more than 200 kb. Genomes grouped by sampling-site type. 
Each box represents a phage genome or plasmid-like sequence, and boxes are 
horizontally arranged in order of decreasing genome size. The size range for 
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Extended Data Fig. 3 | Examples of phage genomes that display GC skew 
indicative of bidirectional replication. a, b, Example phage genomes with GC 
skew patterns that are strongly indicative of bidirectional replication (origin- 
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to-terminus) that is typically found in bacteria (however, the origin may not 
correspond tothe start of the genome).c, d, Phage genomes with GC skew 
patterns that are suggestive of unidirectional patterns. 
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Extended Data Fig. 4| Example of the alternative coding of phages. Overall, analysis of well-annotated genes supported code 16 as the best choice 
Comparisons of gene predictions for a region with genes of clearly predicted (TAG to X, as X could not be clearly resolved on the basis of sequence 
function in MOS_PHAGE_COMPLETE_32_3.a, The standard (code 11) genetic alignments with related proteins). 


code. b, Both TAG and TAA repurposed (code 6).c, TAG repurposed (code 16). 
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Extended Data Fig. 5 | Phylogenetic and protein-cluster relationships 
between phages. a, The phylogenetic tree of phages based on the MCPs. The 
outer ring shows genome length; bars in red indicate genomes reconstructed 
and reported in this study and bars in blue indicate database genomes. The 
next ring indicates the environment of origin. The inner ring indicates the 
phylum of the host (black indicates unknown). Superimposed colours indicate 
named clades that consist of huge phages that were identified in the terminase 
tree. Colours areas in Fig. 2. b, Hierarchical clustering dendrogram of phage 
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genomes based on the Jaccard distance between the presence or absence 
profiles of protein families, performed using an average linkage method. The 
outermost ring shows phage genome length, the next ring shows the 
environment of origin, then predicted phylum affiliation of bacterial hosts. 
Superimposed colours indicate named clades that consist of huge phages that 
were identified in the terminase tree. Colours are as in Fig. 2. The clustering 


supports the phylogenetic analyses shown inaand Fig. 2. 
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Extended Data Fig. 6 | Protein-clustering network for phages and plasmids. Each node represents a genome and each edge is the hypergeometric similarity 
Network analysis using vContact2 and Cytoscape® based on the number of (>30) between genomes based on shared protein clusters. This analysis was 
shared protein clusters between the genomesin this study, RefSeq prokaryotic | usedtohelpto distinguish between the classification of genomes as phage, 
virus genomes and 400 randomly sampled plasmid sequences from RefSeq. plasmid or unknown. 
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Extended Data Fig. 7 | Phylogenetic analysis of tRNA synthetase. 

a, Aminoacyl tRNA synthetases were detected in many huge phages reported 
in this study (Supplementary Table 6). The phylogenetic subtree for glutamate- 
tRNA synthetase sequences from phages (red text and small triangles) that 
place within or close to sequences from Bacteroidetes hosts is shown as an 
example. Bacterial sequences from public databases are indicated by black text 


and those from metagenomes from which huge phage genomes were 
reconstructed are indicated by blue text. Coloured circles indicate the 
predicted phylum of the bacterial host for each phage. b, Phylogenetic tree of 
phage-encoded ribosomal protein S21 and the top RefSeq hits for each protein, 
constructed using IQTREE. Sequences from this study are indicated by red 
branches. 
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Extended Data Fig. 8| Phylogenetic trees of Cas14, CRISPR-Cas type V-U and Cas9. a, Phylogenetic tree for Cas14 and type V-U. b, Phylogenetic tree for Cas9. 


Sequences from this study are indicated by red branches. 


Cas7 | 


Unidentified Helicase 


Tree scale: 1 +. 


Cas3 


[I Unidentified Helicase 


Tree scale: 1-11 


[| Cas4-like proteins 


|_| Cas4 


Extended Data Fig. 9 | Variant type I CRISPR-Cas system and Cas4-like 
proteins found in the genomes of huge phages. a, Locus architecture for 
type-I variant CRISPR phages. An interesting type-I system identified in huge 
phages lacks Cas6 but has Cas5, whichis most similar to the Cas5d protein from 
typelI-C, in which Cas5d acts as the pre-crRNA endonuclease (a role commonly 
reserved for Cas6). The proposed active site residues of Cas5d are to some 
extent different in the Cas5 of this system, although this may still confer 


processing activity, as this change is also observed in other Cas6 homologues. 
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are indicated by red branches. c, Phylogenetic tree of Cas4, Cas4-like proteins 
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Our current knowledge about nucleocytoplasmic large DNA viruses (NCLDVs) is 
largely derived from viral isolates that are co-cultivated with protists and algae. Here 
we reconstructed 2,074 NCLDV genomes from sampling sites across the globe by 


building onthe rapidly increasing amount of publicly available metagenome data. 
This led to an 11-fold increase in phylogenetic diversity and a parallel 10-fold 
expansion in functional diversity. Analysis of 58,023 major capsid proteins from large 
and giant viruses using metagenomic data revealed the global distribution patterns 
and cosmopolitan nature of these viruses. The discovered viral genomes encoded a 
wide range of proteins with putative roles in photosynthesis and diverse substrate 
transport processes, indicating that host reprogramming is probably acommon 
strategy in the NCLDVs. Furthermore, inferences of horizontal gene transfer 
connected viral lineages to diverse eukaryotic hosts. We anticipate that the global 
diversity of NCLDVs that we describe here will establish giant viruses—which are 
associated with most major eukaryotic lineages—as important players in ecosystems 


across Earth’s biomes. 


Large and giant viruses of the NCLDV supergroup have complex 
genomes with sizes of up to several megabases, and virions that 
are a similar size to, or even larger than, small cellular organisms! °. 
These viruses infect a wide range of eukaryotes from protists to ani- 
mals*. Marker gene surveys have shown that NCLDVs are not only 
extremely abundant and diverse in oceans>’, but can also frequently be 
found in freshwater® and soil’. However, the discovery of large and giant 
viruses has mainly been driven by their co-cultivation with amoebae or 
isolation together with their native hosts’**. Only recently, metagen- 
omicand single-cell genomic studies have facilitated the discovery of 
several new NCLDV members and showed that cultivation-independent 
methods are applicable to these viruses just as they are to uncultivated 
Bacteria and Archaea? “. 

Here, we have used a multistep metagenome data-mining, binning 
and iterative-filtering pipeline (Extended Data Figs. 1, 2 and Supple- 
mentary Text 1), which led to the recovery of genomes representing 
2,074 putative NCLDV populations from 8,535 publicly available 
metagenomes in the Integrated Microbial Genomes and Microbi- 
omes (IMG/M) database”. The assembly size, GC content, coding 
density and copy number of nucleocytoplasmic virus orthologous 
genes (NCVOGs)"* were comparable to previously described NCLDV 
genomes, supporting the classification of these genomes as giant virus 
metagenome-assembled genomes (GVMAGs) (Extended Data 
Figs. 3, 4 and Supplementary Tables 1-3). Using an approach that 
relied on conserved NCVOGs, we estimated genome completeness 
and contamination, which led to the classification of 773 high-qual- 
ity, 989 medium-quality and 312 low-quality GVMAGs (Extended 


Data Figs. 1,4 and Supplementary Tables 1, 4), in line with the MIUViG 
recommendations”. 

Augmenting the existing NCLDV phylogenetic framework with the 
GVMAGs substantially increased the diversity of this proposed viral 
order (Fig. 1a and Supplementary Data 1). The resulting phylogenetic 
tree expanded from 205 to 2,279 viral genomes, which can now be 
divided into 100 potentially genus- or subfamily-level monophyletic 
clades spanning 10 provisional superclades, compared with the previ- 
ously recognized 20 genera’. This translates into an 11-fold increase in 
phylogenetic diversity of the NCLDVs. Notably, the addition of the novel 
viral genomes did not change the basic topology of the NCLDV tree but 
rather altered the contribution of existing groups, the Mimiviridae 
in particular, to the total viral diversity. Furthermore, the presence 
of conserved NCVOGs in lineage-specific patterns strengthens the 
hypothesis of acommon evolutionary origin of this viral group’. Novel 
groups of viruses with no isolate representatives appeared within the 
existing taxonomic framework (that is, metagenomic giant virus line- 
ages (MGVLs)). The greatest number of GVMAGs could be attributed to 
MGVL57 (n=205), the Yellowstone Lake mimiviruses (YLMVs; n=119) 
and MGVL42 (n = 84). In addition, several established viral lineages 
were considerably extended, such as the prasinoviruses (n = 77), iri- 
doviruses (n=59), cafeteriaviruses (n=43), phaeocystisviruses (n=37), 
klosneuviruses (n= 36), tetraselmisviruses (n =34) and raphidoviruses 
(n=26), some of which previously consisted of single isolates. In total, 
the GVMAGs increased the 123,000 previously known NCLDV proteins 
that clustered in 47,700 protein families to more than 924,000 proteins 
in 508,000 protein families (Extended Data Fig. 5a). Pfam-A protein 
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Fig. 1| Metagenomic expansion of the NCLDV diversity. a, Maximum- 
likelihood phylogenetic tree of the NCLDV inferred from aconcatenated 
protein alignment of five core NCVOGs"*. Branches in dark red represent 
published genomes and branches in black represent GVMAGs generated in this 
study. Shades of grey indicate boundaries of genus- and subfamily-level clades; 
previously described lineages are labelled. Tree annotations from inside 

to the outside: (1) superclade (SC), (2) GC content, (3) assembly size and (4) 
environmental origin. b, Distribution of NCLDV lineages across different 
habitats. The bars adjacent to the heat map show the total number of detected 


domains could be assigned to less than one third (31%) of these proteins 
(Extended Data Fig. 5b). The potentially most-versatile viral lineage on 
the basis of known gene functions were the klosneuviruses, for which 
more than 1,200 different protein domains could be detected (Extended 
Data Fig. 5b). MGVL5S7, MGVL58, YLMVs and klosneuviruses were the 
most-diverse lineages on the basis of their overall gene content, as 


MGVL68 


Environmental origin 
i | Algae 

| | Bioremediation 
ea] Freshwater 

ia} Marine 
Non-marine saline 
fa Plants 

B Sediment 

Hi Terrestrial 

_ Thermal springs 
8) Wastewater 


pureococcusvituses 


MGVL65 


-—————_4 
1 substitution 
per site 


MGVL66 


Total jammy 100% 
MCP 


0% 


[j= rari B_zes Other 
2 2 YOO LAD oN Sa 
a Sdetdsas8® atte eBeace v 0 9 
3 2G66G5G5155953 SSRs a 
Sn BHaSSS°S0GS2225 ma Number of 
Saif 25 8 
Ss 9 a 7c € Total assembly MCP detected 
2 35 8 S size (Gb) per assembled Gb 
= 


MCPs per habitat (facing to the right) and per lineage (facing downwards) as 
total count (total bar length) and corrected count on the basis of the average 
copy number of MCPs inthe respective lineage (darker shaded bar length). 
The plotincludes only lineages for which at least 1OO MCPs could be detected. 
NCLDV lineages with available virus isolates are indicated in red. The turquoise 
dashed line indicates the total size of the metagenome assemblies that were 
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indicated by alow number of shared protein families compared with 
the total number of protein families (Extended Data Fig. 5c). MGVL27, 
medusaviruses, sylvanviruses and MGVL24 represented the viral line- 
ages with the highest genome novelty; for these lineages, on average, 
less than 15% of proteins showed similarity to known NCLDV proteins 
(Extended Data Fig. 6). Notably, clades that had been predominantly 
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sampled in the past with several viral isolate genomes sequenced, 
suchas marseilleviruses, poxviruses, pandoraviruses and faustoviruses, 
were nearly absent in the environmental microbiome data. This find- 
ing indicates that these viruses or their hosts have comparably low 
abundances inthe samples analysed our dataset. It also suggests that 
there is askew in the isolation and co-cultivation efforts of giant viruses 
using selected non-native hosts in laboratory setups’*°. Large-scale, 
cultivation-independent genome-resolved metagenomics alleviates 
such bias and provides a more-global snapshot of diversity and the 
spatial distribution of NCLDVs in their natural habitats. 

To further deepen our understanding of the environmental distribu- 
tion patterns of the NCLDVs, we performed a survey of the major capsid 
protein (MCP) across all public metagenomic datasets. We identified 
more than 58,000 copies of this protein, of which 67% could be assigned 
to viral lineages (Fig. 1b). Among the most-commonly found lineages 
were prasinoviruses, MGVL57 and YLMV with more than 1,000 occur- 
rences each. At the same time, only a few MCPs (less than 100) were 
detected in viruses that have repeatedly been isolated in co-cultivation 
with amoebae, such as megamimiviruses, marseilleviruses and faus- 
toviruses'*°. In our environmental survey, MCPs were predominantly 
found in marine (around 55%) and freshwater (about 40%) and—toa 
much lesser extent—in terrestrial (less than 1%) environments. Some 
NCLDV lineages occurred solely in either freshwater (YLMV, MGVL33 
and MGVL36) or marine (prasinoviruses, MGVL42 and MGVL66) sys- 
tems, whereas members of other lineages were found in both—or in 
an even-wider range of—environments (suchas klosneuviruses, which 
were found in freshwater, marine, non-marine saline, terrestrial, waste- 
water and host-associated ecosystems). Large and giant viruses could 
also be detected in hydrothermal vents and thermal springs; however, 
comparably few MCPs were present in these habitats (Fig. 1b). Project- 
ing the distribution of NCLDVs onto a global scale makes their ubiqui- 
tous nature apparent (Extended Data Fig. 7). These viruses can be found 
almost anywhere with many different lineages often co-occurring in 
close proximity to each other, suggesting that their discovery is chiefly 
limited by sampling effort. 

Considering the ubiquitous prevalence of large and giant viruses, we 
aimed to investigate the potential influences that these viruses have on 
their hosts. The detrimental effect of viral infections on their eukary- 
otic hosts are well-known’; however, a few recent studies have shown 
that NCLDVs might also complement the metabolism of their host, 
for example, by encoding transporters that take up nutrients, such as 
nitrogen, or fermentation genes”. Expanding these initial findings, 
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eukaryotes or bacteria; circles indicate vertical transmission after ancient HGT 
or gene birth in the NCLDV; a darker colour indicates the predominantly 
observed mode of transmission (five or more events). The stacked bars onthe 
right side of the heat map show, for each observed protein domain, the 
proportional distribution across different habitat types. Bars on the far right 
indicate the total number of observations for each protein domain. 


our data showed that diverse lineages across all NCLDV superclades 
encoded enzymes with potential roles in photosynthesis, diverse 
substrate transport processes, light-driven proton pumps and retinal 
pigments (Fig. 2). Maps of the presence, absence and prevalence of 
these genes revealed lineage- and environment-specific patterns. 
Most-commonly observed across a wide-range of habitats were ABC 
transporters, chlorophyll ab-binding proteins and bacteriorhodop- 
sin-like proteins (Fig. 2, Supplementary Note 2 and Supplementary 
Table 5). Transporters for ammonium, magnesium and phosphate, 
which are likely to be of importance for hosts in oligotrophic envi- 
ronments such as the surface ocean, were predominantly found in 
marine viruses. Enzymes such as ferric reductases and multicopper 
oxidases—which facilitate the uptake of iron”*™, an essential trace 
element that is often growth-limiting, especially in photosynthetic 
organisms*—were encoded in GVMAGs sampled across different 
habitats. This wealth of virus-encoded genes with roles in energy 
generation and nutrient acquisition has far-reaching implications 
for ecosystem dynamics. Metabolic reprogramming refers to acom- 
mon phenomenon in which bacterial viruses obtain genes from their 
hosts and maintain them to support host metabolism”. Our results 
illustrate that in a similar manner, NCLDV-mediated host reprogram- 
ming is probably an important strategy to increase viral fecundity 
and at the same time render a short-term competitive advantage 
of infected eukaryotic host cells, especially under nutrient-limited 
conditions. 

In agreement with previous studies” *°, many of the identified 
viral genes with predicted effects on host cell processes were prob- 
ably acquired from their hosts through horizontal gene transfer (HGT) 
(Fig. 2 and Extended Data Fig. 8). Other genes were present across 
different viral lineages and superclades, suggesting ancient transfer 
followed by vertical inheritance during the course of NCLDV evolu- 
tion or the origin of the respective gene ina common ancestor of this 
group of viruses. A notable example is the group of rhodopsin-like 
domain-containing proteins, which we found in 555 of the GVMAGs. 
Type-1rhodopsins in algae-infecting phycodnaviruses and in viruses of 
heterotrophic choanoflagellates have been reported in previous studies 
and comprise viral rhodopsin groups] and I’°*"’. However, in light of 
our extended sampling of NCLDV genomes, it becomes evident that 
NCLDVs encoded more-diverse rhodopsins than described (Extended 
Data Fig. 8), which comprise approximately one quarter of the total 
known diversity of rhodopsins and include proteins from all publicly 
available metagenomes (Extended Data Fig. 9). Notably, the phylogeny 
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Fig. 3 | HGT between NCLDV and their putative eukaryotic hosts. Undirected 
HGT network with nodes that represent previously described viral lineages and 
MGVLs, coloured onthe basis of NCLDVsuperclade affiliation, with names above 
the node and their putative hosts (highlighted in black with names belowthe 
node, coloured onthe basis of lifestyle); edges are weighted on the basis of the 
number of detected transfers. Connections comprising at least four transfers 
are shown. Experimentally verified virus—host associations are highlighted in 
yellow with names in bold. The proportion of HGT candidates assigned to hosts 
from different major eukaryotic lineages is shown asa pie chart. 


of the viral rhodopsins from all NCLDV superclades exhibits a strongly 
supported monophyletic signal, which implies that this gene might 
represent an ancestral trait of the NCLDV that was subsequently lost in 
some lineages. In addition to viral rhodopsin group | and II, additional 
NCLDV rhodopsins branch closely to their cellular counterparts and 
have probably been acquired by HGT from different hosts (Extended 
Data Fig. 8). In a similar manner, putative NCLDV heliorhodopsins 
were found intertwined with their homologues in the algae Chrysoch- 
romulina and Micromonas (Extended Data Fig. 8). In addition to the rho- 
dopsins, our dataset contained 119 GVMAGs that encoded carotenoid 
oxygenases, which potentially modulate light-harvesting capacity or 
synthesize bioactive compounds”. It is conceivable that some of the 
NCLDV rhodopsins function in conjunction with the carotenoid oxy- 
genases and have important roles in modulating host-cell processes; 
for example, by acting as light-driven proton pumps, as photorecep- 
tors in host phototactic motility or as photoprotectants”****—each of 
these functions lead to metabolic advantages of infected populations. 

Uptake of host genes is acommon mechanism in the evolution of 
NCLDVs?"76, Using HGT analyses, we assigned putative hosts to dif- 
ferent NCLDV lineages. Analysis of 2,040 genes that have probably 
undergone HGT provided linkage information for 50 viral lineages 
to 32 groups of putative eukaryotic hosts (Fig. 3 and Supplementary 
Table 6). Notably, 17 out of 23 viral lineages that contained genomes 
from isolated viruses could be connected through HGT to their experi- 
mentally verified native hosts, such as most algae-infecting viruses 
and metazoa-infecting ascoviruses, namaoviruses and poxviruses, as 
well as connecting klosneuviruses to Kinetoplastida”’*®. Our analysis 
further confirmed Acanthamoebaas a host of pandoraviruses, pitho- 
cedratviruses, medusaviruses, marseilleviruses and megamimiviruses. 
Notably, megamimiviruses, which have exclusively been obtained 
through co-cultivation with amoebae, showed not only HGT with this 
host but were linked even more strongly to multicellular animals. The 
best-connected NCLDV lineage was the klosneuviruses, a viral subfamily 


mainly known from metagenomic studies’"’*”. Our HGT network 
revealed that klosneuviruses have a diverse putative host range of 
mainly heterotrophs, including Anthoathecata—to which it showed 
the strongest connection—as well as fungi and arthropods, and differ- 
ent protists, including slime moulds. By contrast, Oomycetes, Dikarya, 
fungi incertae sedis and Streptophytina emerged as putative hosts 
for the greatest number of different NCLDV lineages, despite the lack 
of isolation of NCLDVs from any of these organisms. With predicted 
hosts in Opisthokonta, Amoebozoa, Excavata, Archaeoplastida, Cryp- 
tista and the Stramenopila, Alveolata, Rhizaria (SAR) supergroup, our 
results suggest that members of the NCLDV might be able to infect 
most major eukaryotic lineages” (Fig. 3). This is consistent with pre- 
vious reports based on eukaryotic genome data” and experimental 
data showing that large and giant viruses infect marine arrow worms”, 
epithelial cells in fish gills*® and potentially also corals and sponges”. 
Of note, our analysis did not reveal linkage to human hosts. We expect 
that with improved sampling of host genomes—particularly genomes 
of underexplored protists and algae—host linkage through HGT will 
yield an even more comprehensive picture of the host range and 
evolutionary histories of NCLDVs. 

Overall, we leveraged the availability of metagenomic data generated 
by the global sampling efforts of acommunity of scientists to expand 
our insights into the diversity, host metabolic complementation and 
putative host range of large and giant viruses. NCLDV infections prob- 
ably occur in all major eukaryotic lineages, with repercussions for many 
of Earth’s major biogeochemical processes. Our data and findings rep- 
resent a solid foundation and expansive resource for future giant-virus 
research efforts to deepen our understanding of the evolutionary and 
ecological bearings of these viral giants. 
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Methods 


Generation of models to detect NCLDV proteins 

Initial hidden Markov models (HMMs) for the MCPs were built from 
a multiple sequence alignment of published NCLDV MCPs and sub- 
sequently updated on the basis of extracted metagenomic NCLDV 
MCP sequences. We screened around 537 million proteins encoded 
on about 45.1 million contigs with a length greater than 5 kb avail- 
able in 8,535 public metagenomes in IMG/M*® (June 2018) for contigs 
that encode the NCLDV MCP using a version of hmmsearch (v.3.1b2, 
http://hmmer.org/) that is optimized“ for the supercomputer Cori, 
witha set of models for the NCLDV MCP (https://bitbucket.org/berke- 
leylab/mtg-gv-exp/) and an E-value cut-off of 1x10. The 1,003,222 pro- 
teins found on the 77,701 contigs with hits for MCPs were then clustered 
with CD-hit* at a sequence similarity of 99% to remove nearly identical 
and identical proteins. This resulted in 524,161 clusters and singletons. 
The cluster representatives were used to infer protein families using 
orthofinder (v.2.27) with default settings and the -diamond flag***”. 
Multiple sequence alignments were built with mafft*s (v.7.294b) for 
protein families that included at least 10 members and corresponding 
HMM models were obtained with hmmbuild (v.3.1b2, http://nmmer. 
org/). This led to a total of 7,182 HMMs that can detect NCDLV pro- 
teins that were then tested against all public genomes in IMG/M* (June 
2018). Models that gave rise to hits above an £-value cut-off of 1x 10° in 
more than 10 reference genomes were removed. The resulting 5,064 
models were then used for targeted binning of NCLDV metagenome 
contigs. 


Identification of NCLDV-specific genome features and design of 
an automatic classifier 

A set of representative genomes of bacteria, archaea, eukaryotes 
and non-NCLDV viruses was gathered from the IMG/M database** 
(June 2018) and combined with NCLDV genomes assembled from 
metagenomes and protist genomes downloaded from NCBI GenBank 
to identify NCLDV-specific genome features. Genes were predicted 
for these genomes using Prodigal’ (v.2.6.3; February, 2016) in both 
‘regular’ mode (default parameters) and with the option ‘-n’ activated, 
which forces a full motif scan. For genomes of less than 100 kb, the 
option ‘-p meta’ was used to apply precalculated training files rather 
than training the gene predictor fromthe genome, as recommended by 
the tool documentation. Next, aset of different metrics was calculated 
for each genome on the basis of the genes predicted with a confidence 
of >90 and score of 50. These included gene density (number of genes 
predicted on average per 10 kb of genome), coding density (number of 
bp predicted as part of a coding sequence per 10 kb of genome), spacer 
length (average length of the spacer between the predicted ribosomal 
binding site (RBS)), predicted start codon for genes in which a putative 
RBS was detected and RBS motif profile (the proportion of each type 
of RBS predicted in the genome, see below). 

For the RBS motif profile, motifs were predicted using the full motif 
scan option of prodigal (see above). Notably, some of these motifs 
may not represent true RBSs, but are instead other conserved motifs 
(including transcription-related motifs) found upstream of start 
codons in these different genomes. These motifs were grouped into 
11 categories as follows: (1) ‘None’ for cases in which prodigal did not 
predict aRBS; (2)‘SD_Canonical’ for different variations of the canoni- 
cal AGGAGG Shine-Dalgarno sequence (for example, AGGAG, AGxAG, 
GAGGA, as well as motifs identified by Prodigal as ‘3Base_5BMM’ or 
‘4Base_6BMM)); (3) ‘SD_Bacteroidetes’ for variations of the motif pre- 
dicted typically from Bacteroidetes genomes (TA{2,5}T{0,1}: T followed 
by 2-5 As, and with sometimes a terminal T); (4) ‘Other_GA’ for motifs 
that include ‘GA’ patterns but that are different from the canonical 
Shine-Dalgarno sequence, for example, GAGGGA, typically identified 
ina few archaeal and bacterial genomes; (5) ‘TATATA_3.6’ for variations 
of the motif typically detected in NCLDV, thatis, a motif of 3-6 bp with 


alternating Ts and As (TAT, ATAT, TATA, TATAT, and so on); (6) ‘OnlyA’ for 
motifs exclusively composed of As not already included in a previous 
group, for example, AAAAA, most often found in Bacteroidetes; (7) 
‘OnlyT’ for motifs exclusively composed of Ts not already included ina 
previous group, for example, TTTTT, found at alow frequency insome 
archaeal genomes; (8) ‘DoubleA’ for motifs with two consecutive As not 
already included ina previous group, for example, AAAAC, most often 
found in Bacteroidetes and bacteria from the candidate phyla radia- 
tion (CPR) group; (9) ‘DoubleT’ for motifs with two consecutive Ts not 
already included in a previous group, for example, TACTT, found ata 
low frequency in plants, Bacteroidetes and NCLDV; (10) ‘NoA’ for motifs 
without any As and not included in a previous group, for example, 
TCTCG, found in some archaeal genomes; and (11) ‘Other’ for motifs 
that did not fit into any of these categories. 

Representative genomes were then grouped on the basis of the 
frequency of each motif type through hierarchical clustering (R func- 
tion ‘hclust’). This enabled the delineation of 12 genome groups on 
the basis of taxonomy (at the kingdom or domain ranks) and motif 
profile (Extended Data Fig. 2). Two types of random-forest classifiers 
were then built on the basis of the 14 features (11 motifs, gene density, 
coding density and average spacer length, see above): one for which 
the category to be predicted was binary (that is, ‘Virus NCLDV’ versus 
‘Other’) and one for which the category to be predicted was the set of 
genome groups based on predicted RBS motifs ((NCLDV (non-pan- 
doraviruses)’, ‘animal and plants’, ‘protists & fungi’, ‘canonical bacteria 
and archaea’, ‘bacteroidetes-like’, ‘bacteria (CPR)’, ‘atypical bacteria’, 
‘atypical archaea’, ‘plasmids’ and ‘other viruses’, which include pan- 
doraviruses). The 14 features were evaluated on the whole genomes, 
as well as on fragments of 20 kb and 10 kb selected randomly along 
the genomes. These random fragments were used to train a classifier 
on input sequences more comparable to metagenome assemblies, 
which most often represent short genome fragments of a few kb. 
For these fragments, Prodigal was run with the ‘-p meta’ option and 
default parameters otherwise”, that is, without a full motif scan, as 
these sequences are typically too short to identify de novo RBS motifs. 
Animal and plant genomes were not included in this analysis as these 
are highly unlikely to be assembled from metagenomes. All classifi- 
ers were built using R library randomforest and included 2,000 trees, 
with default parameters otherwise, and 10-fold cross-validation was 
performed to evaluate the classifier accuracy. The probability ‘prob’ 
of NCLDV origin was used as a prediction score to evaluate the classi- 
fiers and was then applied to metagenome assemblies. Because the 
input dataset is easily skewed towards bacterial and archaeal genomes, 
specificity and sensitivity were evaluated separately for each group 
of genome (Extended Data Fig. 2c). Statistical tests were performed 
in R using the package stats (Kolmogorov-Smirnov test)! and effsize 
(Cohen’s effect size)*. 


MAGs from non-targeted binning of IMG genomes 
Complementary to the targeted binning of NCLDV contigs, we per- 
formed genome binning of public metagenomes in IMG/M (assessed 
June 2018)* with MetaBAT (v.0.32.4)°? in the ‘superspecific’ mode, 
using read coverage information, if available in IMG, and a minimum 
contig length of 5 kb. Resulting MAGs were then checked for quality 
using CheckM (v.1.0.7)*. Genome bins with completeness <50% were 
labelled as low quality according to the ‘minimum information for a 
MAG’ (MIMAG) standards”. 


Targeted binning of putative NCLDV metagenome contigs 

The 5,064 NCLDV-specific models were used for hmmsearch (v.3.1b2, 
http://hmmer.org/) on the initial set of around 537 million proteins 
encoded on about 45 million contigs with a length greater than 5 kb 
with an E-value cut-off of 1 x 10° (Extended Data Fig. 1). Inaddition to 
the screening of the metagenomic contigs with NCLDV-specific models, 
we also used an automatic classifier using gene density and RBS motifs 


Article 


(see above). On the basis of the output of the automatic classifier, a 
score was assigned to each contig: ascore of 2if Ratio TATATA_36>0.3 
or Pred_simple NCLDV score > 0.3 and the prediction result was 
‘Virus NCLDV’, a score of 1if Ratio TATATA_36 > 0.3 or Pred_simple_ 
NCLDV score > 0.1 or the prediction result was ‘Virus NCLDV, other- 
wise a score of 0. On the basis of the cross-validation of the classifier, 
these parameters were chosen to maximize sensitivity while retaining 
enough specificity. The resulting set of around 1.2 million contigs with 
an RBS score of at least 1 and/or at least 20% of encoded genes (1 out 
of 5) with hits to the NCLDV models were subject to metagenomic bin- 
ning as follows: for each metagenome, putative NCLDV contigs were 
extracted and binning performed with MetaBAT™ (v.2) and contig read 
coverage information was used as input in case it was available in IMG”. 
The targeted binning approach gave rise to around 72,000 putative 
NCLDV MAGs. 


Filtering of GVMAGs 

Contigs with a length of less than 5 kb were removed from GVMAGs. 
Filtering was performed on the basis of the copy number of 
NCVOGs"* (Supplementary Tables 2, 3). GVMAGs were removed 
when they encoded more than 20 copies of NCVOGO023, 4 copies 
of NCVOG0038, 12 copies of NCVOGOO76, 7 copies of NCVOG0249 
or 4 copies of NCVOG0262. On the basis of the copy numbers of 
16 conserved NCOVGs (NCVOG0035, NCVOG0036, NCVOG0038, 
NCVOG0052, NCVOGO059, NCVOGO211, NCVOG0O249, NCVOGO256, 
NCVOG0262, NCVOG1060, NCVOG1088, NCVOG1115, NCVOGI1117, 
NCVOG1122, NCVOG1127 and NCVOG1192), which are usually present 
at low copy numbers across all published NCLDV genomes, a duplica- 
tion ratio was calculated as follows. The total number of copies of the 
16 NCVOGs in the respective GVMAG was divided by the total number 
of unique observations of the 16 NCVOGs. GVMAGs with a duplication 
ratio higher than three were excluded from the dataset. We then used 
Diamond BLASTp“ against the NCBI non-redundant (nr) database 
(August 2018) and assigned a taxonomic affiliation on the basis of 
best BLASTp hits against Archaea, Bacteria, Eukaryota, phages or 
other viruses (including NCLDVs) to proteins using an E-value cut- 
off of 1x 10°. Best hits of query proteins to proteins derived from 
MAGs from the Tara Mediterranean metagenome binning survey” 
were disregarded owing to the high number of misclassified genomes 
in this dataset. Proteins without a hit in the NCBI nr database were 
labelled as ‘Unknown’. We then applied filters to remove contigs 
from GVMAGs on the basis of the distribution of taxonomic affilia- 
tion of best blast hits (Supplementary Table 7). Finally, alignments 
were built with mafft* (v.7.294b) for NCVOGO023, NCVOGO038, 
NCVOG0076, NCVOG0249 and NCVOGO262. Positions with 90% or 
more gaps were removed from the alignments with trimal*® (v.1.4). 
Protein alignments were concatenated and a species tree constructed 
with IQ-tree”’ (LG + F + R8, v.1.6.10). The phylogenetic tree was then 
manually inspected and for each clade outliers were removed on the 
basis of the presence, absence and copy numbers of 20 conserved 
NCVOGs*", duplication factor (see above), coding density, GC content 
and genome size. In addition, GVMAGs that represented singletons 
on long branches were manually removed. The filtered dataset was 
then clustered together with all available NCLDV reference genomes 
(December 2018) using average nucleotide identities of greater than 
95% and an alignment fraction of at least 50% with FastANI® (v.1.1). For 
each 95% average nucleotide identity cluster the 6 NCVOGs" with the 
on-average longest amino acid sequences (NCVOG0022, NCVOGO023, 
NCVOG0038, NCVOGO059, NCVOGO256 and NCVOGI1117) were sub- 
jected toa within-cluster all-versus-all BLASTp. GVMAGs that had any 
full-length 100% identity hits between any of these maker proteins to 
other cluster members were removed from the dataset as potential 
duplicates. Duplicate GVMAGs originating from the conventional 
binning approach were removed first and GVMAGs with the largest 
assembly size were retained. 


GVMAG quality on the basis of estimated completeness and 
contamination 

Estimation of the quality of MAGs is critical for their interpretation 
and use in downstream applications. Standards exist for bacterial and 
archaeal MAGs that have proposed a three-tier classification (high, 
medium or low quality) based on estimated genome completeness 
and contamination®. These completeness and contamination metrics 
are typically calculated on the basis of a set of universal single-copy 
marker genes. A set of conserved genes in the NCLDV are the NCVOGs”, 
of whicha subset has been shown to be probably vertically inherited’® 
(NCVOG20, Supplementary Table 2). We calculated for each superclade 
the average number of NCVOG20 present either as a single copy or 
as multiple copies (Supplementary Table 3). We then compared the 
number of observed single- and multicopy NCVOG20 in every GVMAG 
to the mean number of observations in the respective superclade. 
Considering the high genome plasticity of NCLDVs”, we tolerated 
a deviation from the mean by a factor of 1.2, which was considered 
low contamination, and a factor of 2 was considered medium con- 
tamination (Extended Data Fig. 4 and Supplementary Table 4). Higher 
deviations from the superclade mean were potentially caused by a 
non-clonal composition of the GVMAG; these were, as a consequence, 
considered to be of high contamination. We also estimated complete- 
ness onthe basis of the presence of the NCVOG20 compared with other 
members of the respective superclade. The presence of 90% or more 
of the NCVOG20 compared with the superclade mean resulted ina 
classification as high quality in terms of completeness. If at least 50% 
of NCVOG20 were present ina GVMAG then the respective GVMAG was 
classified as medium quality in terms of estimated completeness, or 
low if less than 50% of NCVOG20 were present (Extended Data Fig. 4 
and Supplementary Table 4). The final GVMAG quality was determined 
on the basis of a combination of contamination and completeness 
(Supplementary Table 8). Additional criteria to assign GVMAGs tothe 
high-quality category were the presence of no more than 30 contigs, a 
minimum assembly size of 100 kb and the presence of at least one contig 
with a length greater than 30 kb. To assign a GVMAG to the medium- 
quality category were the presence no morethan 50 contigs, aminimum 
assembly size of 100 kb and the presence of at least one contig witha 
length greater than 15 kb. 


Annotation of GVMAGs 

Gene calling was performed with GeneMarksS using the virus model”. 
For functional annotation proteins were subject to BLASTp against 
previously established NCVOGs" and the NCBI nr database (May 2019) 
using Diamond (v.0.9.21) BLASTp” with an E-value cut-off of 1.0 x 10>. 
In addition, protein domains were identified by pfam_scan.pl (v.1.6) 
against Pfam-A® (v.29.0), and rRNAs and introns were identified with 
cmsearch using the Infernal package“ (v.1.1.1) against the Rfam data- 
base® (v.13.0). No rRNA genes were detected in the final set of GVMAGs. 
The eggNOG mapper® (v.1.0.3) was used to assign functional categories 
to NCLDV proteins. Protein families were inferred with PorthoMCL” 
(version of December 2018) with default settings. 


Survey of the NCLDV MCP 

We used hmmsearch (v.3.1b2, http://hmmer.org/) optimized for the 
supercomputer Cori* to identify all copies of MCP encoded in the 
final set of GVMAGs and NCLDV reference genomes. Proteins were 
extracted and multiple sequence alignments were created with mafft*® 
(v.7.294b) for 74 NCLDV lineages with at least 5 copies of MCP. For each 
lineage-specific MCP alignment, we inferred models with hmmbuild 
(v.3.1b2, http://nmmer.org/). Using these models, the modified version 
of hmmsearch (v.3.1b2, http://hmmer.org/)** was used to identify all 
MCPsin the entire set of metagenomes (IMG/M®, June 2018), MCPs with 
identical amino acid sequences were excluded as potential duplicates. 
A logistic-regression-based classifier (sklearn LogisticRegression, 


solver = ‘Ibfgs’, multi_class = ‘ovr’) was trained for each NCLDV line- 
age taking into account the score distribution of all lineage MCPs hits 
against the entire set of lineage-specific MCP models. The accuracy of 
the classifier was 0.861. Unbinned metagenomic MCPs were assigned 
to NCLDV lineages if the classifier returned a probability greater than 
50% (sklearn predict_proba), or as ‘novel’ ifthe probability was 50% or 
below. We then normalized the environmental MCP counts on the basis 
of the observed average copy number of MCPin GVMAGs and reference 
genomes inthe respective lineage. Distribution of NCLDV lineages on 
the basis of MCPs was projected ona world map with Python 3/basemap 
on the basis of coordinates provided in IMG metagenomes”. 


NCLDV species tree 

To build a species tree of the extended NCLDV, viral genomes with at 
least three out of five core NCVOGs” were selected: DNA polymerase 
elongation subunit family B (NCVOG0038), D5-like helicase-primase 
(NCVOGO0023), packaging ATPase (NCVOGO249), DNA or RNA heli- 
cases of superfamily II (NCVOGO076), and poxvirus late transcription 
factor VLTF3-like (NCVOGO262). The NCVOGs were identified with 
hmmsearch (version 3.1b2, http://hmmer.org/) using an E-value cut-off 
of 1x10, extracted andaligned using mafft** (v.7.294b). Columns with 
less than 10% sequence information were removed from the alignment 
with trimal®. The species tree was then calculated on the basis of the 
concatenated alignment of all five proteins with 1Q-tree” (v.1.6.10) with 
ultrafast bootstrap® and LG + F + R8 as suggested by model test as the 
best-fit substitution model”. The percentage increase in phylogenetic 
diversity” was calculated on the basis of the difference of the sum of 
branch lengths of the phylogenetic species trees of the NCLDV includ- 
ing the GVMAGs compared with a NCLDV species tree calculated from 
published NCLDV reference genomes (n= 205, no dereplication based 
on the average nucleotide identity) with IQ-tree as described above. 
Phylogenetic trees were visualized with iTol” (v.5). Genus or subfam- 
ily level lineages were defined on the basis of their monophyly in the 
species tree and presence or absence pattern of conserved NCVOGs 
(Supplementary Table 4). If no viral isolates were present in the respec- 
tive monophyletic clade we designated it MGVL. Neighbouring lineages 
with isolates and MGVLs were further combined under the working 
term superclade. Branch lengths separating clades differ based onthe 
density of sampled viruses. 


Protein trees 

Target proteins were extracted from NCLDV genomes and used to query 
the NCBI nr database (June 2018) with Diamond BLASTp™. The top-50 
hits per query were extracted, merged with queries, dereplicated on 
the basis of protein accession number and aligned with MAFFT (-linsi, 
v.7.294b)*®, trimmed with trimal* (removal of positions with more than 
90% of gaps) and maximum-likelihood phylogenetic trees inferred with 
1Q-tree’’ (multicore v.1.6.10) using ultrafast bootstrap® and the model 
suggested by the model test feature implemented in 1Q-tree® based 
on Bayesian information criterion. Selected models are indicated in 
the legend of Extended Data Fig. 8. Owing to its size, the phylogenetic 
tree for ABC transporter was inferred with FastTree” (v.2.1.10) LG and 
can be accessed at https://bitbucket.org/berkeleylab/mtg-gv-exp/. 
Phylogenetic trees were visualized with iTol” (v.5). Information on 
functional genes including parent contigs is provided in Supplemen- 
tary Table 5. 


Virus-host linkage through HGT 

To generate a cellular nr database, all non-cellular sequences and 
sequences from the Zara Mediterranean genome study” were removed 
from the NCBI nr database. All proteins in the NCLDV genomes were 
then subjected to Diamond BLASTp” against the cellular nr database 
using an E-value cut-off of 1x 10, an alignment fraction of 50% and 
a minimum sequence identity of 50%. Best blast hits within the same 
lineage were removed. Proteins that had a hit in cellular nr with alower 


Evalue compared with hits in the NCLDV blast database were consid- 
ered HGT candidates. The total number of best hits from lineage pan- 
proteomes against defined groups of Eukaryotes were then used as edge 
weights to build an HGT network. The network was created in Gephi 
(v.0.92)”? using a force layout and filtered at an edge weight of 2. Pfam 
annotations of HGT candidates were based on the most commonly 
detected domains and functional categories were assigned with the 
eggNOG Mapper (v.1.03)®. Information on HGT candidates including 
parent contigs is provided in Supplementary Table 6. The number of 
HGT linkages was limited by the available of reference genomes and 
the stringency applied. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


All GVMAGs of estimated high and medium quality with an NSO of 
greater than 50 kb and estimated low contamination have been depos- 
ited at NCBI GenBank as MN738741-MN/741037 under BioProject ID 
PRJNA5S88800. Nucleotide and protein sequences of GVMAGs can be 
directly downloaded from https://genome.jgi.doe.gov/portal/GVMAGs 
and https://figshare.com/s/14788165283d 65466732, and will be avail- 
able in the Integrated Microbial Genome/Virus (IMG/VR) system” at 
time of the v.3.0 release. All of the sequence data and metadata fromthe 
samples used in this study can further be accessed through the IMG/M 
system’ (https://img.jgi.doe.gov) and NCBI SRA using the metagenome 
identifiers provided in Supplementary Table 1. Sequence alignments, 
phylogenetic trees and other data underlying this study can be down- 
loaded from https://genome.jgi.doe.gov/portal/GVMAGs. 


Code availability 


The NCLDV classifier can be obtained from https://bitbucket.org/ 
berkeleylab/mtg-gv-exp/. 
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Extended Data Fig. 1| Discovery pipeline for GVMAGs. Approximately 

46 million contigs that were longer than5 kb and were available in IMG/M» 
(June 2018) were screened for potential NCLDV contigs using a combination of 
5,064 NCLDV-specific HMMs and arandom-forest classifier based on gene 
density and RBS motifs. The resulting set of 1.2 million contigs was then 
subjected to metagenomic binning using MetaBAT2”, with binning performed 
separately for each metagenome that contained putative NCLDV contigs. To 
the resulting approximately 72,000 GVMAGs, we added around 180,000 low- 
quality MAGs based on MIMAG* that were generated by non-targeted binning 


LK 


vy 


Random Forest 


Taxonomic distribution 
Gene density/rbs classifier 


of best blastp vs nr hits 
Tree-based outlier 
detection: 


singleton gvMAGs a 
ancestral NCVOGs ie ‘ge ‘ ‘ 
‘argete: 
_ »260 
Filtering De-replication 52 


(ANI 95%) 


reference 
2,074 strain-level 


1,941 species-level 
GVMAGs 


72,000 putative 
gvMAGs 


180,000 MAGs 2 high 
Binning of public cs 

IMG metagenomes omedium 
1S} 
4 

= low 
3 

0 200 400 600 800 1,000 1,200 
# GVMAGs 


of metagenomes in IMG/M. The resulting set of approximately 252,000 
GVMAGs and MAGs were then filtered on the basis of assembly size and using a 
combination of the consensus of taxonomic affiliation of best blast hits across 
contigs, the presence or absence and copy numbers of frequently conserved 
NCLDV genes taking into account neighbouring taxa in the species tree and 
random-forest classifier based on gene density and RBS motifs. Outlier contigs 
were removed as described in the Methods and only MAGs that showed a copy- 
number distribution of frequently conserved NCLDV genes similar to closely 
related viral genomes were maintained in the final dataset. 
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Extended Data Fig. 2| The RBS classifier. Unique features of NCLDV genomes 
and efficiency of random-forest classifiers based on these features. a, Gene 
density (y axis, average number of genes predicted per 10 kb of genome) for 
genomic sequences from different types of organisms or entities (x axis). 
Genomes were grouped onthe basis of taxonomy (kingdom and domain ranks) 
as wellas patterns of RBS motifs and gene density. ‘Other euk. viruses’, non- 
NCLDV eukaryotic viruses; ‘NCLDV Pandor.’, pandoravirus and similar NCLDVs; 
‘NCLDV (Other)’, non-pandoravirus NCLDVs. Centre lines of box plots 
represent the median, bounds of the boxes indicate the lower and upper 
quartiles, whiskers extend to points that lie within 1.5x the interquartile range 
of the lower and upper quartiles. Sample sizes (number of genomes) are 
indicated. b, Frequency of RBS motifs identified across different genomes 


groups. RBS motif frequencies were based on prodigal gene prediction using 
the ‘full motif scan’ option. For clarity, only RBS motif frequencies >1% are 
displayed. RBS motif frequencies >30% are highlighted with a bold outline. 
‘Other Euk. viruses’, non-NCLDV eukaryotic viruses; ‘NCLDV (pandoravirus)’, 
pandoravirus and similar NCLDVs; ‘NCLDV (Other)’, non-pandoravirus NCLDVs. 
c, Predictions of NCLDV origin on the basis of genome features and predicted 
RBS motifs by random-forest classifiers for complete genomes (top) and short 
genome fragments (bottom). Predictions for individual genomes were 
obtained througha tenfold cross-validation. Similar results were obtained 
when predicting only two classes (NCLDV and non-NCLDV, displayed here) or 
when predicting classes corresponding to the eight types of genomes. CPR, 
candidate phyla radiation; SD, Shine-Dalgarno sequence. 
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Extended Data Fig. 3 | Features of GVMAGs. a, Mean assembly size, GC content 
and coding density for each lineage in the NCLDV, coloured by superclade, 
individual data points are shown. Data are mean +s.d.b, Assembly metrics of all 
GVMAGs compared to previously published NCLDV genomes included in this 


study. Centre lines of box plots represent the median, bounds of boxes indicate 
the lower and upper quartiles, whiskers extend to points that lie within 1.5 
interquartile range of the lower and upper quartiles. Sample size for the 
published data is 205 genomes and for GVMAGs is 2,074 genomes. 
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LGRS marks resident adult epithelial stem cells at the gland base in the mouse pyloric 
stomach’, but the identity of the equivalent human stem cell population remains 
unknown owing to a lack of surface markers that facilitate its prospective isolation 
and validation. In mouse models of intestinal cancer, LGR5* intestinal stem cells are 
major sources of cancer following hyperactivation of the WNT pathway’. However, 
the contribution of pyloric LGR5* stem cells to gastric cancer following dysregulation 
of the WNT pathway—a frequent event in gastric cancer in humans?—is unknown. Here 
we use comparative profiling of LGR5* stem cell populations along the mouse 
gastrointestinal tract to identify, and then functionally validate, the membrane 
protein AQPS5 asa marker that enriches for mouse and human adult pyloric stem cells. 
We show that stem cells within the AQP5* compartment are a source of WNT-driven, 
invasive gastric cancer in vivo, using newly generated Aqp5-creERT2 mouse models. 
Additionally, tumour-resident AQPS* cells can selectively initiate organoid growth 

in vitro, which indicates that this population contains potential cancer stem cells. In 
humans, AQP3S is frequently expressed in primary intestinal and diffuse subtypes of 
gastric cancer (and in metastases of these subtypes), and often displays altered 
cellular localization compared with healthy tissue. These newly identified markers 
and mouse models will be an invaluable resource for deciphering the early formation 
of gastric cancer, and for isolating and characterizing human-stomach stem cells asa 
prerequisite for harnessing the regenerative-medicine potential of these cells in the 
clinic. 


To identify markers specific to LGR5"2" pyloric stem cells within the 
gastrointestinal tract, we profiled the transcriptomes of quantitative 
(q)PCR-validated LGR5-enhanced green fluorescent protein (eGFP) "2" 
(LGR5* stem cells), LGR5-eGFP'™ (immediate progeny) and unfraction- 
ated populations from the small intestine, colon and gastric pylorus 
of Lgr5-eGFP-IRES-creERT2 mice‘ by microarray, and identified genes 
that are selectively enriched in LGR5-eGFP"®" pyloric stem cells (Fig. 1a, 
Extended Data Fig. la-d, Supplementary Table 1). This dataset also 
revealed the transcriptional signature of LGR5-eGFP"®" colon stem cells 
(Supplementary Table 2). The profiling of LGR5-eGFP"2", LGR5-eGFP'™ 
and LGR5-eGFP’ pylorus populations from another LGRS reporter 
model (LGR5-DTR-eGFP, ref.>) revealed 67 overlapping genes (Fig. 1b). 


Candidate markers were empirically validated by qPCR, in situ hybridiza- 
tion (ISH) and immunostaining. Optimal candidates presented enriched 
expression in the LGR5-eGFP"#" pyloric population by qPCR, robust local- 
ized mRNA and/or protein expression within the LGR5* pyloricgland base’, 
minimal expression in intestines and gastric corpus and co-expression 
with Lgr5 (Extended Data Fig. 1j-n). Six candidates were selected: a-1,4- 
N-acetylglucosaminyltransferase (A4GNT), aquaporin 5 (AQP5), gastric 
intrinsic factor (GIF), mucin 6 (MUC6), solute carrier protein 9a3 (SLC9A3 
(also known as NHE3)) and secreted phosphoprotein 1(SPP1, also knownas 
osteopontin)) (Fig. lc-m, Extended Data Fig. le-n, Supplementary Table 3). 

To validate the six candidates as markers of populations enriched 
in pyloric stem cells, we generated eGFP-IRES-creERT2 mouse models 
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Fig.1| Comparative profiling of LGR5 populations in gastrointestinal 
tissues identifies new pyloric-specific markers. a, Heat map of 
transcriptomes of LGR5-eGFP"®", |GR5-eGFP'™ and unsorted populations 
from mouse pylorus, small intestine (SI) and colon (n=4, 2 and 3 biological 
replicates, respectively) fromLgr5-eGFP-IRES-creERT2 mice. Candidates are 
enriched only in the GFP"®" pylorus population (black box). Values are log,- 
transformed. b, Overlap of candidate markers from Lgr5-eGFP-IRES-creERT2 
(blue) andlgr5-DTR-eGFP (yellow, n=4 biological replicates) models (statistical 
significance assessed by hypergeometric distribution). c, Shortlisted pylorus- 
specific markers. Fold change is log,-transformed. GI, gastrointestinal. 

d, Relative AgpS5 expression (by qPCR) in gastrointestinal populations 
(n=2technical replicates of pooled sample from 8 mice, means represented). 
e-j, ISH expression of candidate markers in pylorus (n=3 mice). k-m, AQP5 
expression with LGRS in pylorus by immunostaining (k, m) and ISH (I) 

(n=3 mice). Scale bars, 25 um. 
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for AgpS, A4gnt, Spp1 and Sic9A3 (Extended Data Fig. 2a), as well as 
Aqp5-2A-eGFP, SIC9A3-2A-eGFP, AqpS-2A-creERT2 and Slc9A3-2A-creERT2 
mice, in which endogenous gene expression is unaffected (Extended 
Data Fig. 2b). eGFP expression in the pylori of the 2A-eGFP models 
recapitulated endogenous gene expression in the pylorus and small 
intestine (Extended Data Fig. 2c-f). Additionally, the 97.4% concur- 
rence between the AQPS* and SLC9A3* pyloric populations revealed 
by costaining (Extended Data Fig. 2g, h) reaffirms that these markers 
effectively label the same population. 

We evaluated the in vivo contribution of the gland populations that 
express the candidate genes to epithelial renewal via lineage tracing 
in creERT2;Rosa26-tdTomato'™ lines. tdTomato expression was first 
observed at the gland bases in all lines 20-48 h after tamoxifen induc- 
tion, confirming the expected Cre expression domain (Extended Data 
Fig. 3a—d, q-r). After several months (tissue turnover spans 7-10 days’), 
multiple glands that were entirely tdTomato’* were evident throughout 
the pylorus, documenting the long-term self-renewal and multipotency 
ofthecellsexpressingAqp5, Slc9A3, Spp1 or A4gnt (Extended DataFig.3e-h, 
s-t). Notably, no intestinal tracing was observed, except for transient 
reporter expression within the villi of Sic9A3-cre models (Extended Data 
Figs. 2f, 3i-x). These observations confirm that populations that express 
Agp5, Spp1, Sic9a3 and A4gnt contain pyloric stem cells. 


AQPS enriches for stem cells in mice 


The AQP5 water channel protein® emerged as a promising candidate 
for isolating pyloric stem cells from mice and humans. eGFP expression 
was restricted to pyloric gland bases in Aqp5-eGFP-IRES-creERT2 and 
Aqp5-2A-eGFP mouse models (Extended Data Fig. 2c, w). Sorted eGFP* 
cells from adult Aqp5-eGFP-IRES-creERT2 mice (Fig. 2a) showed a 9-fold 
and 15-fold enrichment of AgpSand LgrS transcripts, respectively, over 
eGFP cells (Extended Data Fig. 2u, v). By immunostaining, endogenous 
AQPS protein colocalized with eGFP (Extended Data Fig. 2w). Thus, the 
AQPS models faithfully report Agp5 expression in pyloric gland bases. 

Next, we found that the AQP5* population overlapped with GIF and 
KI67, but not gastrin (GAST), chromogranin A (CHGA) or mucin 5, 
subtypes A and C (MUCSAC), whereas the LGR5-eGFP"" population 
expressed major gastric lineage markers (GIF, GAST and CHGA) and KI67 
(Fig. 2b-e, Extended Data Fig. 2k-m, x). This observation was confirmed 
via single-cell analysis of LGR5-eGFP"@" pyloric stem cells, using CEL- 
seqand RacelD’. The LGR5-eGFP"®" compartment comprised one major 
and two minor subpopulations: the major subpopulation co-expressed 
some of the newly identified pyloric markers (AQP5, GIF and MUC6) 
(Extended Data Fig. 2n—q), one minor subpopulation expressed CHGA 
and GAST, and the other expressed KRT8 and KRT18 (Extended Data 
Fig. 2r, s). Proliferation-marker expression was significantly enriched 
in the major subpopulation (Extended Data Fig. 2t). AQPS staining on 
Lgr5-DTR-eGFP pylorirevealed a 94.1% overlap between the two popula- 
tions (Extended Data Fig. 2i,j), underscoring the CEL-seq finding that 
AQP5 marks the major subpopulation of LGR5"®" cells. 

Lineage tracing using adult Aqp5-eGFP-IRES-creERT2;Rosa26-tdTom- 
ato‘ (Aqp5S-IRES-creERT2,tdTomato) mice detailed the homeostatic 
behaviour of AQPS* cells: tdTomato’* cells appeared exclusively at the 
gland bases 20 h after tamoxifen induction (Fig. 2f, Extended Data 
Fig. 3y), expanded to clones reaching gland surfaces by 5 days (Fig. 2g) 
and persisted for 1 year, demonstrating self-renewal of AQP5* cells 
(Fig. 2h, i, Extended Data Fig. 3z, c’). Uninduced controls presented 
negligible tdTomato* clones (Extended Data Fig. 3a’, b’). Six months 
after induction, tdTomato’* clones comprised the major pyloric lineages 
that expressed GIF, GAST, CHGA and MUCSAC (Fig. 2j-m), confirming 
multipotency in the AQPS5* population. Ablating endogenous AQP5* 
cells using a Aqp5-2A-DTR model severely disrupted the gland bases 
(Extended Data Fig. 4k-r). 

Tracing was absent from corpus, small intestine and colon, except 
for Brunner’s glands (Extended Data Fig. 3d’-h’). Our Aqp5-2A-eGFP 
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Fig. 2| Pyloric AQP5* population contains stem cells in vivo. a, Agp5-eGFP- 
JRES-creERT2 transgene. b-e, eGFP co-immunostaining in Aqp5-eGFP-IRES- 
creERT2 pylorus with GIF (b), GAST (c), MUCSAC (d) and CHGA (e) (n=3 mice). 
f-i, Lineage tracing of Agp5-eGFP-IRES-creERT2;tdTomato'™ pylori at 20h (f), 

5 days (g), 1month (h) and 6 months (i) (n=3 mice). j-m, Co-immunostaining of 
tdTomato (tdTom) with CHGA (j), GAST (k), GIF (I) and MUCSAC (m) (n=3 mice 
per marker). n, FACS gating for sorting AQP5* and AQPS cells. Mean+s.e.m., 
n=6 biological replicates. SSC, side scatter. o, p, Relative AgpS (0) and LgrS (p) 
expression (by qPCR) insorted populations. Mean +s.e.m.;n=10 biological 
replicates. q, LgrS and candidate-marker enrichment in volcano plot of 
differentially expressed genes in AQP5* population by microarray. 


and Aqp5-creERT2 models also faithfully reported endogenous AQP5 
expression in tissues such as cornea, lung, mammary gland and salivary 
gland®’ (data not shown). 

To evaluate the utility of AQP5 as a marker for isolating enriched 
pyloric stem cell populations, we sorted AQPS* and AQPS cells from 


n=A4 biological replicates, P value from one-way analysis of variance (ANOVA) in 
Partek analysis software. N/C, no change. r-s, Outgrowth efficiency of single 
antibody-mediated FACS-sorted AQP5* and AQPS cells (r) with representative 
image (s).n=S biological replicates. t, Longevity and highest passage number 
of organoids from AQP5* (green circles) and AQP5S (grey diamonds) cells. 

n=3 biological replicates. u, Organoid differentiation protocol. v-x, Relative 
Agps (v), Lgr5 (w) and MucSac (x) expression (by qPCR) in AQP5* cell-derived 
organoids on day 4 of differentiation (diff). n=5 biological replicates. Scale 
bars, 25 um (b-m), 100 pm (s). Graphs represent mean + s.e.m. with two-sided 
t-test (0, p, r,t, V-x). 


adult wild-type mice using AQP5 antibody by fluorescence-activated 
cell sorting (FACS) (Fig. 2n, Extended Data Fig. 4a, b) to profile their 
transcriptomes and evaluate their in vitro capacity to form organoids. 
AgpS5 and Lgr5 were markedly enriched in the AQP5* population rela- 
tive to the AQPS” population, by qPCR and microarray (Fig. 20, p, 
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Fig. 3| Human pyloric AQP5‘ cells are stem cells ex vivo. a, Immunostaining of 
AQPS in normal human pylorus. n=3 biological replicates. bp—d, AQP5 and LGRS 
expression in pylorus (b), near gland surface (c) and base (d) by ISH. 

n=3 biological replicates. cand dare higher-magnification images of the areas 
inthe top and bottom black boxes, respectively, in b. e, f, AQP5 and KI67 
colocalization (arrowheads). n=3 biological replicates. fshows a higher- 
magnification image of the area in the white box ine. g, FACS gating for sorting 
human AQPS* and AQPS cells. Mean +s.e.m.,n=4 biological replicates. 

h, Relative AQP5 expression (by qPCR) in FACS-sorted AQP5* and AQPS cells. 
n=3 biological replicates. i,j, Outgrowth efficiency of single AQP5* and AQPS” 
cells in vitro (i) with a representative image (j).n =3 biological replicates. 

k, I, Relative LGRS (k) and MUCSAC (I) expression in (by qPCR) AQPS* cell- 
derived organoids on day 4 of differentiation. n =3 biological replicates. 

m, Heat map of 200 genes with highest variation between normal human 
pylorus AQP5* and AQPS populations by RNA-seq.n=8 biological replicates. 
n, Gene Ontology (GO) terms associated with genes significantly enriched in 
AQPS* population. ORA, overrepresentation analysis. Scale bars, 100 pm 

(a,b, e,j), 50 tm (c, d, f). Graphs are presented as mean +s.e.m., two-sided t-test. 


Extended Data Fig. 4d, e). AQP5* and LGR5-eGFP"®" transcriptomes 
are highly correlated by gene set enrichment analysis (false discovery 
rate P value < 0.001) (Extended Data Fig. 4c), with the AQPS* popula- 
tion presenting strong enrichment of our newly identified markers 
(Fig. 2q, Extended Data Fig. 4f) and the previously published gland-base 
markers Lrig?’° and Runx!1" (Extended Data Fig. 4g). Axin2”, Cck2r (also 
known as Cck2b)” and Sox2" were not enriched in the AQP5* popula- 
tion, consistent with their relatively broad expression within pyloric 
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glands (Extended Data Fig. 4g). Concurring with the immunostaining 
data, AQPS* cells presented high Gif (also known as Cblif), moderate 
Ki67 (also known as Mki67) and no MucSac expression; there was no 
significant difference in Gast and Chga expression between AQPS* 
and AQPS populations, which probably reflects the limited numbers 
of GAST’ G cells and CHGA* endocrine cells within the AQP5 popula- 
tion (Extended Data Fig. 4h). Therefore, this antibody-based strategy 
facilitates the enrichment of mouse pyloric stem cells, independently 
of fluorescent reporters. 

Compared to AQPS cells, AQP5* cells generated threefold-more 
organoids (0.64% versus 2.58%, respectively) that could be maintained 
inthe long term, whereas the few organoids derived from AQPS cells 
died within three weeks (Fig. 2r-t). Organoid initiation frequencies 
from AQPS* cells (2.58%) and LGR5-eGFP* cells (3.09%) from Lgr5-2A- 
eGFP mice were similar (Extended Data Fig. 4s—u), indicating high func- 
tional overlap. AQP5' cell-derived organoids showed heterogeneous 
AQPS expression, which partially overlapped with KI67 (Extended Data 
Fig. 4i, j)—similar to the in vivo pattern of AQPS and KI67. Withdrawal 
of WNT3A, FGF10 and NOGGIN (Fig. 2u) resulted in the downregula- 
tion of the stem cell markers Lgr5 and AqgpS and the upregulation of 
the differentiation marker MucSac in the organoids after three days 
(Fig. 2v-x). Therefore, AQP5is a useful marker for the prospective isola- 
tion of enriched mouse pyloric stem cells. 


AQP5 enriches for stem cells in humans 


We sought to evaluate AQP5 as a marker that facilitates the enrich- 
ment of human pyloric stem cells. AQP5—along with MUC6, A4GNT 
and SLC9A3 (homologous to the mouse pyloric markers)—was exclu- 
sively expressed at human pyloric gland bases, overlapping with LGRS 
(Fig. 3a—d, Extended Data Fig. 5a-f, a’-f’). A minor proportion of the 
human AQPS* cells were KI67* (Fig. 3e, f), reminiscent of the mouse 
pylorus (Extended Data Fig. 2x). Human AQP5* cells overlapped with 
PEPC' and MUC6' populations, but not G/F (also knownas CBLIF) pari- 
etal cells or MUCSAC' foveolar cells (Extended Data Fig. 5g-j). 

We next sorted AQP5‘ cells from healthy human pyloric specimens 
using AQPS antibody and verified a 10.2-fold enrichment of AQP5 
expression in AQPS* cells by qPCR (Fig. 3g, h). AQP5* cells routinely 
established organoids that were passaged for more than three months, 
whereas AQP3S cells never initiated organoids (Fig. 3i,j). The human 
pyloric organoids expressed AQPS5 heterogeneously, partially over- 
lapping with KI67 (Extended Data Fig. 5k, |). Withdrawal of WNT3A, 
NOGGIN and FGF10 resulted in reduced LGR5 and AXIN2—and increased 
MUCSAC and TFF2-expression, indicating differentiation towards 
mucous lineages (Fig. 3k, |, Extended Data Fig. 5m-o). 

Wethen performed RNA sequencing (RNA-seq) on AQPS* pyloric cells 
isolated by FACS from healthy human pylori, and validated the top hits 
using qPCR. We identified more than 500 differentially expressed genes 
that were significantly up- or downregulated by more than fourfold in 
AQPS* versus AQPS populations (Fig. 3m, top candidates are listed in Sup- 
plementary Table 3). Enrichment of 18 candidates inthe AQPS* population 
was validated by qPCR (Extended Data Fig. 6a—e). Genes homologous to 
the newly identified mouse pyloric markers—including AQP5, A4GNT and 
MUCé6and anintestinal stem cell marker, SMOC2°—were upregulated in 
the AQPS’ fraction by RNA-seq and qPCR (Extended Data Fig. 6a). ISH con- 
firmed SMOC2as being expressed ina subset of gland-base cells (Extended 
Data Fig. 6f, f’, f’”). Gene Ontology analysis revealed that approximately 
half of the candidates were membrane-expressed and had protein-binding 
activity (Fig. 3n, Extended Data Fig. 6b), which suggests that additional 
markers could be identified from our list to potentially further enrich 
for pyloricstem cells. Moreover, components of key signalling pathways 
(for example, WNT and Notch), chemokine signalling and extracellular 
matrix—as well as some uncharacterized genes—were also enriched in 
the AQPS' population, highlighting the additional biological insight to 
be gained from the profile (Fig. 3n, Extended Data Fig. 6c-g). 
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Fig. 4| AQP5* populationis asource of WNT-driven, invasive gastric cancer. 

a, Characteristics of models of gastric cancer. UTR, untranslated region. 

b-h, Characterization of Aqp5S-eGFP-IRES-creERT2 APK intramucosal gastric 
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(black box) in b for AQP5-GFP (d), E-cadherin (e), phospho-(p)AKT (f), MAPK (g) 
and B-catenin (h). his a higher-magnification image of Extended Data Fig. 9a. i-o, 
Characterization of Sic9a3-2A-creERT2 AP gastric adenocarcinoma. H & Estain 
of the entire pyloric region (i) and the focal invasion (black box ini) (j). 
Immunostaining for AQP5 (k), E-cadherin (I), B-catenin (m), phospho-AKT 


Collectively, these data demonstrate the utility of AQP5 as a marker 
for isolating enriched populations of endogenous stem cells from 
human stomach epithelia for downstream purposes. 


AQP5' cells as source of gastric cancer 

We targeted conditional oncogenic mutations to pyloric-stem-cell- 
enriched populations using our new creERT2 mouse models to evaluate 
the contribution of these populations to gastric cancer and circumvent 
the rapid lethality of LGR5S* cell-driven models of cancer. To determine 
how frequently pathways are co-dysregulated in gastric cancer, we ana- 
lysed transcriptomes of patients with gastric cancer from The Cancer 
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(n) and MAPK (0) in the focal invasion. p, Costaining of eGFP, KI67 and 
E-cadherin in Aqp5-creERT2 APK pyloric tumour. q, Initiation frequencies of 
eGFP* cells (APK-eGFP*) versus eGFP” cells (APK-eGFP ) from AgpS- 
creERT2 APK pyloric tumour. Paired two-sided t-test, mean+s.e.m.r, Meanand 
individual longevities of organoids derived from APK-eGFP*, APK-eGFP’ and 
AQPS-eGFP* cells from normal pylorus (norm-GFP") cells seeded with 
exogenous growth factors (GF) for a week and without GF for the next four 
weeks. s, Representative images of organoids derived from APK-eGFP* cells 
and norm-eGFP* cells in the respective growth factor conditions. Scale bars, 
1mm (b, i), 100 pm (d-h, j-o), 20 um (p), 500 pm (s). 


Genome Atlas (TCGA) (n=155) and East Asian cohorts (n=42) for WNT/B- 
catenin, PI3K and KRAS signalling activities (Extended Data Fig. 7a, b), 
the components of which are frequently mutated in gastric cancer®"®. 
Using published gene signatures” ”, we found that WNT/B-catenin 
signalling was commonly hyperactivated in human gastric cancer (more 
than 80% in both cohorts of patients), and frequently co-occurred with 
hyperactivation of PI3K and/or RAS signalling (57.1-64.3%)(Extended 
Data Fig. 7). We thus recapitulated these co-dysregulated pathways by 
crossing our pyloric creERT2 drivers to conditional Apc, Pten and Kras©¢”? 
alleles, and induced hyperactivation of the pathways at adulthood. 
All mouse models developed sizeable tumours exclusively in the 
pylorus, with latencies ranging from 1to 11 months after induction (in 
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n=30 out of 34 mice) (Extended Data Fig. 8a-g). Inthe tumours, which 
were classified as tubular-type gastric adenocarcinomas (according to 
the World Health Organization (WHO) classification), malignant struc- 
tures surrounded by stroma and inflammatory cells replaced normal 
glands. Sole hyperactivation of Kras©” did not produce pyloric tumours 
(Extended Data Fig. 8h, i). The pylori of uninduced mice of the same 
cancer genotype that lacked Cre were normal (Extended Data Fig. 8j). 

Across all combinations of oncogenes, the tumour incidence in 
2A-creERT2 models was 100% compared to 82.6% in IRES-creERT2 
models (Fig. 4a), reflecting differences in Cre activation efficacies. 
The 2A-creERT2 models displayed almost contiguous tumour growth, 
which contrasts with the multifocal lesions in JRES-creERT2 models 
(Extended Data Fig. 8b-g). Hyperactivation of WNT/B-catenin signal- 
ling alone (Ape (hereafter, A)) was sufficient to drive tumorigenesis, 
and co-activation of the PI3K and/or KRAS pathways (Ape“;Pten™ 
(hereafter AP) or Ape“ Pten™; Kras°™ (hereafter, APK)) accelerated 
tumour development and progression (Extended Data Fig. 8a—g). We 
also observed focal invasions through the muscularis mucosae in A 
and AP models (/RES-creERT2, 15.8% and 2A-creERT2, 66.7%) (Fig. 4a). 
As expected, intestinal tumours were never observed (Extended Data 
Fig. 8k, |). 

We characterized all tumours to detail pathway activation, prolif- 
eration status, lineage-marker expression and epithelia and stroma 
constitution. As there are no major phenotypic differences between 
the models, we present data from AgpS-IRES-creERT2 APK tumours and 
focalinvasions from Slc9a3-2A-creERT2 AP tumours (Fig. 4b, c, i,j). The 
pyloric tumours and focal invasions were predominantly tdTomato’, 
confirming that they originated from AQP5* or SLC9A3* cells (Extended 
Data Fig. 9b, j). In contrast to the normal pylorus (Extended Data 
Fig. 9p-w), the gastric adenocarcinomas presented hyperactivation 
of WNT/B-catenin, MAP kinase (MAPK) and phospho-AKT pathways, as 
evidenced by increased levels of expression of nuclear and/or cytoplas- 
mic B-catenin, MAPK and phospho-AKT, respectively (Fig. 4f-h, m-o, 
Extended Data Fig. 9a). These regions were also highly proliferative, and 
lacked GIF, GAST and MUCSAC expression (Extended Data Fig. 9d-g, 
l-o). Inthe Aqp5-IRES-creERT APK model, tdTomato’ cells that retained 
E-cadherin expression were also found throughout the tumour stroma 
(Fig. 4e). In the Slc9a3-2A-creERT2 AP model, E-cadherin’ cells infiltrated 
through the muscularis mucosae (Fig. 41). Immunostaining for eGFP 
reporter expression revealed AqgpS expression in a subpopulation of 
the pyloric tumours, some of which were KI67* (Fig. 4d, k, p). Many of 
the AQPS* cells inthe tumours co-expressed Lgr5, with increased Aqp5S 
expression in tumours compared to adjacent normal mucosa (Extended 
Data Fig. 9h, h’,h’”’). There was alow incidence of tumours within non- 
gastrointestinal organs, suchas the salivary gland (less than 25% in AqpS- 
JRES-creERT2-driven cancer models), which did not affect survival to 
preclude the development of gastric adenocarcinoma (Extended Data 
Fig. 9i, i’). Thus, our mouse models support pyloric-stem-cell-enriched 
populations as being a source of invasive, WNT-driven gastric cancer 
and are valuable for modelling gastric cancer in vivo. 


AQPS5* tumour cells show ex vivo stemness 


We then determined whether AQPS* tumour cells behave differently 
fromtheir AQP5 counterparts. After confirming that the eGFP reporter 
recapitulated endogenous AQPS expression in the Aqp5-/RES-cre- 
ERT2 APK tumours (Extended Data Fig. 9x, x’), we cultured sorted eGFP* 
and eGFP cells (Extended Data Fig. 9y, z, z’, z’”). eGFP* tumour cells 
reproducibly generated organoids that could be serially propagated in 
the absence of exogenous growth factors, whereas AQP5-GFP tumour 
populations never produced organoids, despite containing KI67* cells 
(Fig. 4p-s). Moreover, although AQP5-GFP* normal cells could initiate 
organoids with growth factors, they died upon the removal of growth 
factors (Fig. 4r, s). These data suggest that the stem potential of the 
tumour is found within the AQP5* compartment. 
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AQP5 expression in human gastric cancer 


We surveyed AQPS expression in human gastric cancer by immunostain- 
ing ona tissue microarray of 145 samples of distal gastric cancer com- 
prising intestinal, diffuse and mixed subtypes with variable grades of 
differentiation. AQP5 was expressed in most intestinal, diffuse and 
mixed cases (Extended Data Fig. 10a—e). Contrasting with normal 
pylorus, 96.1% of the tumour samples displayed cytoplasmic AQP5 
expression, and 37.9%, 3.9% and 37.9% had membranous, nuclear and 
multiple localizations of AQP5, respectively (Extended Data Fig. 10a-e). 
Although the intracellular localization of AQPS has previously been 
reported in other cancers” ~, its functional relevance is unknown. 

Full sections of 54 advanced human distal gastric adenocarcinomas 
and 12 metastatic lesions showed that most expressed AQPS5 (Extended 
Data Fig. 10f). Cytoplasmic AQP5 was observed in all AQP5* samples, 
and membranous or luminal, nuclear and multiple sites of AQPS locali- 
zation were found in 53.7%, 9.8% and 53.7% of the samples, respec- 
tively (Extended Data Fig. 10f-j). In 51.2% of the sections, submucosal 
malignant cells expressed more AQPS than their mucosal counterparts 
(Extended Data Fig. 10g-j, 0, p). AQP5 was expressed in poorly cohesive 
tumour cells in 70% of the cohort (Extended Data Fig. 10j, 0). All AQP5* 
cells in the submucosa retained E-cadherin expression, and a subset 
co-expressed KI67 (Extended Data Fig. 10k—n). In cases of intestinal 
metaplasia (which is strongly correlated with gastric cancer”*™), 46.2% 
displayed mild-to-moderate AQPS5 expression (Extended Data Fig. 100, 
q). AOPS5 was also weakly expressed in a subset of signet ring cells in 
66.7% of the samples (Extended Data Fig. 100, r). Of the metastatic 
lesions with AQP5* primary tumours, 83.3% contained AQP5* tumour 
cells in the lymph node (Extended Data Fig. 100, s). 

Our broad survey shows that AQP5is commonly expressed in primary 
intestinal and diffuse subtypes of gastric cancer, as well asin metastases 
of these subtypes. 

Efforts to exploit the therapeutic potential of stem cells require 
functionally validated markers for their prospective isolation. To our 
knowledge, for the first time we have purified enriched populations of 
human pyloric stem cells, and demonstrated the direct contribution 
of AQPS* cells in mouse models of gastric carcinogenesis. Although 
the expression of secretory markers such as GIF and MUC6 may not 
conform to the classical ‘undifferentiated’ stem cell model, it is well- 
established in the liver and lung that more-specialized cells can serve 
as homeostatic stem cells, an emerging trend in epithelial organs”’. 
AQPS is known for regulating water transport in healthy tissues®, and 
has increasingly been implicated in cancers asa driver of proliferation 
andinvasiveness in vitro®” **. Various human cancers— including gastric, 
breast, soft tissue sarcoma, lung, oesophageal and colorectal cancers— 
present high AQP5 expression”!2>335->”, We show that AQP5 is expressed 
inmost human primary tumours, and metastases, of intestinal and dif- 
fuse subtypes of gastric cancer. We found that tumour-resident AQPS* 
cells in our mouse model of gastric cancer selectively exhibited ex vivo 
stem potential, indicating that the AQPS* tumour population contains 
cancer stem cells. Future evaluation of the stem potential of AQPS* in 
human gastric cancers using our antibody-based isolation protocols 
has the potential to reveal opportunities for developing more-effective 
cancer therapeutic strategies. 
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Methods 


Mice 

For exon 1 knock-ins, an eGFP-IRES-creERT2 cassette was inserted 
immediately downstream of the start codons of AgpS, A4gnt, Spp1 and 
Slc9a3 gene loci by homologous recombination in embryonic stem 
cells, as illustrated in Extended Data Fig. 2a. For 3’ UTR knock-ins, a 
2A-creERT2, 2A-eGFP or 2A-DTR cassette was inserted immediately 
before the stop codon of Agp5S, Lgr5 and/or Slc9a3 gene loci by homolo- 
gous recombination in embryonic stem cells, as illustrated in Extended 
Data Fig. 2b, thereby preserving the intact protein-coding region and 
endogenous expression of the genes. Rosa26-tdTomato'™ (Ai14) (JAX 
ID 007914) mice®® were obtained from Jackson Laboratories. Lgr5- 
eGFP-IRES-creERT2 (ref. *) (JAX ID 008875), Lgr5-DTR-eGFP (MGI ID 
5294798)°, Kras‘“© (JAX ID 019104)”, Ape (MGIID 1857966)*° and 
Pter (MGIID 2182005)" mice have previously been described. All Cre 
and eGFP lines were bred as heterozygotes except A4gnt-IRES-creERT2 
and Spp1-IRES-creERT2 mice, which were bred as homozygotes. All 
mouse experiments were approved by the Institutional Animal Care 
and Use Committee of A*STAR, and performed in compliance with all 
relevant ethical regulations. The maximum tumour size allowed by 
the IACUC is 20 mm in any dimension and none of the experiments 
exceeded this limit. For all experiments, adult mice (not selected for 
sex) with a minimum age of 7-8 weeks were used. The experiments 
were not randomized, and there was no blinded allocation during 
experiments and outcome assessment. No statistical methods were 
used to predetermine sample size. Genotyping primers are collated 
in Supplementary Table 5. Mouse lines are available from N.B. upon 
request. 


Human material 

Normal human pylorus for FACS was provided by K. G. Yeoh, J.So and 
A. Shabbir, NUS Department of Medicine and Pathology (granted under 
IRB protocol 11-167E) and N. Inaki and T. Tsuji, Ishikawa Prefectural 
Central Hospital. Informed consent was obtained from all patients and 
experiments were performed in compliance with all relevant ethical 
regulations. Human distal cancer formalin-fixed paraffin-embedded 
(FFPE) sections were provided by NUS Department of Medicine and 
Pathology (granted under IRB protocol-11-167E) and Leeds Teaching 
Hospitals NHS Trust (granted under IRB protocol CAO1 122). 


Mouse treatment 

Mice were each injected with tamoxifen dissolved in sunflower oil 
intraperitoneally, at 4 mg tamoxifen per 30 g body weight. Diphtheria- 
toxin-treated mice were injected with a single dose of diphtheria toxin 
dissolved in PBS intraperitoneally, at 0.5 pg diphtheria toxin per 30 g 
body weight. 


Gland isolation, cell dissociation and flow cytometry 

Mouse pylorus. Mouse pylorus was incubated in chelation buffer (5.6 
mM sodium phosphate, 8 mM potassium phosphate, 96.2 mM sodi- 
um chloride, 1.6 mM potassium chloride, 43.4 mM sucrose, 54.9 mM 
D-sorbitol, 1 mM dithiothreitol) with 5 mM EDTA at 4 °C for 2h. Glands 
were isolated by repeated pipetting of finely chopped pylorus tissue 
in cold chelation buffer. Chelation buffer containing isolated glands 
was filtered through a 100-um filter mesh, and centrifuged at 720g at 
4°C for 3 min. The pellet was resuspended in TrypLE (Life Technolo- 
gies) with DNasel (0.8 U/l) (Sigma) and incubated at 37 °C for 10 min 
with intermittent trituration for digestion into single cells. Digestion 
was quenched by dilution with cold HBSS buffer. The suspension 
was centrifuged at 720g at 4 °C for 3 min. For AQPS antibody stain, 
the pellet was resuspended in HBSS with 2% fetal bovine serum (FBS) 
(Hyclone) with AQP5-AF647 (Abcam, ab215225) at 1:500 dilution and 
incubated on ice at 30 min in the dark. The pellet was subsequently 
washed twice with cold HBSS and spun at 800g for 3 min at 4 °C. The 


pellet was resuspended in HBSS with 2% FBS. Before sorting, 1 pg/ml 
propidium iodide (Life Technologies) was added to the cell suspen- 
sions, filtered through a 40-pm strainer and sorted on BD Influx Cell 
Sorter (BD Biosciences). Cells were collected in RLT Plus buffer (Qiagen) 
for RNA extraction or HBSS with 2% FBS and 1% PenStrep (Gibco) for 
organoid culture. 


Human pylorus. Human pylorus was collected in advanced DMEM/F-12 
medium with 10 mM HEPES, 2 mM Glutamax (incubation buffer, all 
from Life Technologies), supplemented with 1x Anti-Anti (Life Tech- 
nologies) and1mM N-acetylcysteine (Sigma). After at least 3 washes in 
HBSS, the pylorus was finely chopped and digested in incubation buffer 
supplemented with 1 mg/ml collagenase (Gibco) and 2 mg/ml bovine 
serum albumin (Sigma) for 30 min at 37 °C with intermittent mixing. 
The remainder of the processing protocol is identical to that for the 
mouse tissue described in ‘Mouse pylorus’. Cells for organoid culture 
were collected in organoid culture medium with growth factors and 
0.2% growth-factor-reduced Matrigel (Corning) (v/v). 


Organoid culture 

Organoid culture of FACS-isolated single human and mouse pylorus 
cells were performed as previously described”. In brief, single cells 
were resuspended in growth-factor-reduced Matrigel (Corning) and 
cultured in basal medium (advanced DMEM/F-12 medium with 10 mM 
HEPES, 2mM Glutamax, 1x N2, 1x B27 (all Invitrogen), N-acetyl-cysteine 
(Sigma) and primocin (Invivogen)) supplemented with the following 
growth factors: EGF (Invitrogen), GAST (Sigma), FGF10 (Peprotech), 
Noggin (Peprotech), WNT3A (Millipore), R-spondin and ROCK inhibi- 
tor Y27632 (Sigma). A83-01 (Tocris) was also added to human pyloric 
cultures. Mouse cancer organoids were grown in only basal medium 
after first week of culture. Organoids were passaged when confluent, 
at least once a week. Only organoids beyond 100 pm and 200 um in 
diameter with a clear central lumen are scored as organoids for Fig. 2 
and Fig. 3, respectively. 


RNA isolation and qPCR 

Tissues were lysed in Trizol (Qiagen) and single cells were lysed in RLT 
Plus buffer (Qiagen). RNA was subsequently isolated with RNeasy 
Universal Plus kit (Qiagen) and cDNA was generated with Superscript 
III (Life Technologies) according to the manufacturer’s instructions. 
qPCR was performed with a minimum of three biological replicates per 
gene using SYBR green dye (Promega) according to the manufacturer’s 
instructions, and ran on StepOne or Quantstudio7 qPCR machines 
(Applied Biosystems). Analysis was carried out using the double C, 
method on Step One Software on the respective qPCR machines 
(Applied Biosystems). qPCR validation of top candidates from RNA- 
seq was performed on 2-4 ng of SPIA-amplified cDNA derived fromthe 
Ovation Pico WTA system (Nugen Technologies) owing to limitation 
of RNA availability. In the event that any of the samples for a specific 
target does not amplify, the relative expression values of all the sam- 
ples for that target are increased by one to enable visualization of the 
values on a log scale. Sequences of qPCR primers are collated in Sup- 
plementary Table 5. 


Transcriptome profiling and analysis 

Single-cell RNA-seq, CEL-seq and RacelD. Single LGR5-eGFPh2" 
pyloric epithelial cells from gr5-DTR-eGFP mice’ were isolated by FACS 
(as described in ‘Mouse pylorus in ‘Gland isolation, cell dissociation and 
flow cytometry’) and collected in each well of 96-well plates. Total RNA 
extracted from each cell was used to generate single-cell RNA expres- 
sion libraries as previously described’. A total of 285 LGR5-eGFP@2" 
cells from 3 mice were sequenced on Illumina HighSeq 2500 instru- 
ment using 101 base-pair paired-end sequencing. K-means clustering 
in RacelD was used to delineate clusters of subpopulations, as previ- 
ously described’. 


Microarray and analysis. Labelling, hybridization and washing proto- 
cols for microarrays were performed according to Origene instructions. 
RNA quality was first determined by assessing the integrity of the 28S 
and 18S ribosomal RNA bands on Agilent RNA 60000 Pico LabChips 
in an Agilent 2100 Bioanalyzer (Agilent Technologies). A minimum 
of 2 ng of RNA was used to generate SPIA-amplified cDNA using the 
Ovation Pico WTA system (Nugen Technologies). Five micrograms of 
SPIA-amplified purified cDNA was then fragmented and biotin-labelled 
using the Nugen Encore Biotin module (Nugen Technologies). Micro- 
array was performed using the Affymetrix Mouse ST v.2.0 GeneChips 
(Affymetrix), which consists of more than 28,000 probes for previously 
annotated genes. The individual microarrays were washed and stained 
in an Affymetrix Fluidics Station 450, and hybridized probe fluores- 
cence was detected using the Affymetrix G3000 GeneArray Scanner. 
Image analysis was carried out onthe Affymetrix GeneChip Command 
Console v.2.0 using the MASS algorithm. CEL files were generated for 
each array and used for gene-expression analysis. The CEL files were 
then processed in R (v.3.2.3) with the Bioconductor (v.3.2) libraries 
‘oligo’ (v.1.34.2), ‘pd.mogene.2.0.st’ (v.3.14.1) and ‘limma’ (v.3.26.8). 
We used robust multi-array average to perform background correction 
and normalization with the ‘rma’ function implemented in the ‘oligo’ 
package (‘target’ parameter was set to ‘core’ to obtain expression values 
at the gene level). The experimental design was stored asa single factor 
with individual levels for each combination of LGR5-GFP level (high, 
low or negative) or AQPS status (positive or negative). Linear models 
were fitted to the expression data with the function ‘ImFit’ (default 
parameters). The relevant contrasts were fitted with ‘contrasts.fit’ 
(default parameters); differential expression was tested with ‘eBayes’ 
(default parameters). Differential gene expression was analysed using 
Partek Genomics Suite software (Partek). Relative gene expressions are 
depicted as single values as given by Partek analysis software. Gene set 
enrichment analysis was performed using the GSEA v.6.1**4, 


Rna-seq. AQP5* and AQPS cells were collected directly into RLT Plus 
buffer by FACS sorting. Total RNA was isolated using Qiagen RNeasy 
Micro Kit (Qiagen). RNA quality was first determined by assessing the in- 
tegrity of the 28S and 18S ribosomal RNA bands on Agilent RNA 60000 
Pico LabChips in an Agilent 2100 Bioanalyzer (Agilent Technologies). 
Amplified cDNA library was prepared according to the manufacturer’s 
instructions with SMARTer Stranded Total RNA-Seq Kit v.2 - Pico In- 
put Mammalian (Takara) using 10 ng of input total RNA. Indexed 150- 
bp paired-end sequencing was performed on HiSeq 2500 (Illumina) 
and Illumina real-time analysis software was used for base-calling to 
generate FASTQ files. The reads were mapped to Genome Reference 
Consortium Human Build 38 patch release 12 (GRCh38.p12) with STAR 
software version 2.5.3a with the following options issued: -outFilter- 
Type BySJout,—outFilterMultimapNmax 10,-alignSJoverhangMin 
15,-alignSJDBoverhangMin 1,-outFilterMismatchNmax 12,-outFil- 
terMatchNminOverLread 0.4,-alignIntronMin 20,-alignIntronMax 
2000000,-outSAMattrlHstart 0,-outSsAMmapqUnique 244,-outMul- 
timapperOrder Random,-outReadsUnmapped None,-outFilterIntron- 
Motifs None,-outSAMmode Full,-outSAMattributes All-quantMode 
GeneCounts,-clip3pAdapterSeq AATGATACGGCGACCACCGAGATCT 
ACACTCTTTCCCTACACGACGCTCTTCCGATCT. Counts per sample 
were subsequently concatenated in statistical software, R version 3.2.3, 
and reads were normalized with trimmed mean of Mvalues normaliza- 
tion as implemented in edgeR version 3.12.1 (with limma_3.26.9). Dif- 
ferential expression testing was performed with the edgeR function 
‘glmQLFit’ using a design matrix that took sample batches and AQP5 
status into account. Differentially expressed genes were those with 
more than twofold change between AQP5S* and AQPS samples, with 
false discovery rate < 0.05. Owing to the likely inclusion of immune 
cells in the profile, immune-related genes*® were omitted, resulting 
ina final list of >500 differentially expressed genes. Gene Ontology, 


overrepresentation analysis and PANTHER pathway analysis of the 
differentially expressed genes were performed on the PANTHER clas- 
sification system“ using default parameters. 

Transcriptomic pathway signature analysis of human gastric cancer 
level 3 TCGA RNA-seq normalized matrix for 415 GC and 35 normal 
gastric samples, and their corresponding clinical information, were 
downloaded from the Broad Institute TCGA Genome Data Analysis 
Center Firehose (https://gdac.broadinstitute.org/). Gene expression 
data of 200 GC and 100 matched normal gastric samples were gener- 
ated using Affymetrix Human Genome U133 Plus 2.0 Array (GSE15459) 
and processed as previously described”. Allnormal samples, and only 
tumours of antral or pyloric origin, were included for analysis. To deter- 
mine activity of PI3K, WNT and KRAS pathways in primary tumours, we 
used published pathway signatures by several groups: KRAS signature 
based on differential gene-expression analysis between colorectal 
cancers with high KRAS mutation, and wild-type KRAS tumours”; PI3K 
signature composed of genes modulated in vitro by PI3K inhibitors, 
according to the CMap signature”; and finally, intestinal WNT signature 
defined by profiling colorectal cancer cell lines carrying an inducible 
block of the WNT pathway and differential gene-expression analysis of 
human colon adenoma and adenocarcinomas versus normal colonic 
epithelium”. For each pathway signature, only upregulated genes in 
pathway activation were selected for downstream analysis. 

To quantify the relative activation level in a specific oncogenic path- 
way, we derived a‘ score’for each sample profile. In brief, the transcrip- 
tomic p score was defined as the average of standardized expression 
values of those genes upregulated in a specific oncogenic pathway 
(after the log-transformed values centred to the standard deviation 
from the median across the samples included in the analysis). For each 
oncogenic pathway, pt scores were calculated for all normal samples, 
and the p score at the 90% percentile of the normal samples was used 
as the cut-off to define a pathway as hyperactivated. The p score for 
each tumour was determined anda score higher than the cut-off was 
considered to be hyperactivated for that particular pathway. For each 
combination of pathways (WNT, WNT and PI3K or WNT, PI3K and KRAS), 
the concurrence rate was given by the frequency of tumours that were 
hyperactivated in all the pathways in question. 


Histology 

Immunohistochemistry and immunofluorescence. Immunohisto- 
chemistry (IHC) and immunofluorescence were performed according 
to standard protocols. Insummary, tissues were fixed in 4% paraform- 
aldehyde in PBS (w/v) overnight at 4 °C, and processed into paraffin 
blocks. Eight-micrometre sections from the paraffin blocks and tissue 
microarray slides were deparaffinated and rehydrated, followed by 
antigen retrieval via heating to 121 °C in a pressure cooker in standard 
10 mM citric acid pH6 buffer, acommercial citrate pH 6.1 buffer (S1699, 
DAKO) or Tris/EDTA buffer, pH 9.0 (S2367, DAKO). Primary antibodies 
used were chicken anti-EGFP (1:2,000, Abcam, ab290), rabbit anti-EGFP 
(1:200; Cell Signalling, 2956S), rabbit anti-K167 (1:200; Thermofisher, 
MAS-14520), rabbit anti-GIF (1:10,000; provided by D.H. Alpers), rabbit 
anti-RFP (1:200; Rockland, 600-401-379), rabbit anti-aquaporin 5 (1:200, 
Santa Cruz, SC-28628 and 1:500, LSBio, LS-C756566), rabbit anti-SLC9A3 
(1:200, Santa Cruz, SC-16103-R), rabbit anti-mucin 6 (1:200, LsBio, LS- 
C312108), rabbit anti-A4GNT (1:500, Novus Biologicals, NBP1-89129), 
rabbit anti-GAST (1:200, Leica Biosystems, NCL-GASp), mouse anti-MU- 
CSAC (1:200, Leica Biosystems, NCL-HGM-45-M1), rabbit anti-vimentin 
(1:500, Abcam, ab92547), mouse anti-E-cadherin (1:200, BD Transduction 
Laboratories, 610181), mouse anti-B-catenin (1:200, BD Transduction 
Laboratories, 610154), mouse anti-RFP (1:200, Abcam, 129244), mouse 
anti-CHGA (1:200, Abcam, 15160), rabbit anti-phospho-MAPK (1:200, Cell 
Signalling, 4370S), mouse anti-H-K-ATPase (1:1,000, MBL International, 
DO32-3) and rabbit anti-phospho-AKT (1:200, Cell Signalling, 3787L). 
Detailed information about clone number, and antibody validation 
can be found in Supplementary Table 6. The peroxidase-conjugated 
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secondary antibodies used were mouse or rabbit EnVision+ (DAKO) for 
HRP IHC, or anti-chicken, -rabbit or -mouse Alexa 488, 568 or 647 IgG 
(1:500, Invitrogen) for immunofluorescence. GSII-lectin-AF568 (1:500, 
Thermofisher) was incubated on the slides for 1h at room temperature 
together with secondary antibodies. IHC sections were dehydrated, 
cleared and mounted with DPX (Sigma) and immunofluorescence sec- 
tions were mounted in Hydromount (National Diagnostics) with Hoechst 
for nuclear staining. Immunostainings and imaging were performed on 
aminimum of three biological replicates and representative images of 
the replicates are included in the manuscript. 


H&E. H & Estaining was performed on FFPE sections according to stand- 
ard laboratory protocols. 


Whole mount analysis and vibratome sectioning. Tissues were fixed 
in 4% paraformaldehyde in PBS (w/v) overnight at 4 °C. Whole-mount 
tissues were permeabilized in 2% TritonX-100 in PBS (v/v) overnight 
at 4 °C, and 500-um vibratome sections were generated by sectioning 
tissues embedded in 4% low-melting point agarose with a vibrating 
microtome (Leica) and permeabilized in 2% TritonX-100 in PBS (v/v) 
overnight at 4 °C. Rapiclear (Sunjin laboratory) was used to clear whole- 
mount tissues and vibratome sections according to the manufacturer’s 
instructions. Hoescht was used as a nuclear counterstain. 


ISH. ISH and co-ISH were performed using RNAscope” 2.5 High Defini- 
tion Brown Assay and 2.5 High Definition Duplex Reagent Assay (Ad- 
vanced Cell Diagnostics), respectively, according to the manufacturer’s 
instructions. DapB was used as negative control for all the RNAscope 
experiments. ISH and imaging were performed ona minimum of three 
biological replicates and representative images of the replicates are 
included in the manuscript. 


Analysis and scoring of staining on mouse and human FFPE sections. 
Overlap of SLC9A3-eGFP and AQPS-antibody stains was determined by 
counting stained cells contacting the lumen at the base of glands. The 
entire height of the gland base surrounding the gland base lumen had to 
be visible to avoid over- or underrepresentation of localized populations. 

Samples of H & Esections of mouse gastric tumours were evaluated 
by qualified veterinary and clinical histopathologists. Scoring of AQP5 
staining on human gastric cancer specimens was performed by qualified 
clinical histopathologists. The tumour in the tissue section was consid- 
ered positive for AQPS if staining was observed in more than 5% of the 
malignant cells. Subcellular localization of AQPS, relative staining intensi- 
ties and stained features were all determined by qualified pathologists. 


Microscopy imaging 

Image acquisition. IHC and H & Eslides were imaged with Zeiss Ax- 
iolmager Z1 Upright microscope. RNAscope slides and large-area im- 
ages were captured with Nikon Ni-E microscope and DS-Ri2 camera. 
Immunofluorescence slides were imaged using Olympus FV1000 and 
FV3000 confocal microscopes. Cultured organoids were imaged with 
Olympus DP-27 camera on Olympus IX353 inverted microscope. 


Image processing. RNAscope and images with large areas were pro- 
cessed with NIS-Elements AR software (Nikon) with EDF and stitching 
features, respectively. Immunofluorescence images were processed 
using ImageJ (NIH) and whole-mount organoid images were processed 
using Imaris 8.0 (Bitplane). 


Statistics and reproducibility 

Gene-expression data were quantified and depicted as mean +s.e.m. 
Statistical analyses were performed using GraphPad Prism. Data were 
tested for statistical significance by paired two-tailed t-test, unless 
otherwise stated in figure legends. Statistical significance of overlap 
between the two gene sets in Fig. 1b was determined by hypergeometric 


distribution (http://nemates.org/MA/progs/overlap_stats.html). Precise 
Pvalues of statistical significance are shown in the respective figures. 
Representative images of all histological experiments and FACS strate- 
gies were performed at least thrice independently, with similar results, 


Reporting summary 


Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Microarray data that support the findings of this study have been 
deposited in the Gene Expression Omnibus (GEO) under accession code 
GSE121803. RNA-seq data of AQPS* and AQPS’ human samples have also 
been deposited in the GEO, under accession code GSE133036. Source 
Data for Figs. 1-4 and Extended Data Figs. 1, 2, 4-6, 9 are provided with 
the paper. Any other relevant data supporting the findings of this study 
are available from the corresponding author on reasonable request. 


38. Madisen, L. et al. A robust and high-throughput Cre reporting and characterization 
system for the whole mouse brain. Nat. Neurosci. 13, 133-140 (2010). 

39. Jackson, E. L. et al. Analysis of lung tumor initiation and progression using conditional 
expression of oncogenic K-ras. Genes Dev. 15, 3243-3248 (2001). 

40. Shibata, H. et al. Rapid colorectal adenoma formation initiated by conditional targeting of 
the Apc gene. Science 278, 120-123 (1997). 

41. Suzuki, A. et al. T cell-specific loss of Pten leads to defects in central and peripheral 
tolerance. Immunity 14, 523-534 (2001). 

42. Leushacke, M. et al. Lgr5-expressing chief cells drive epithelial regeneration and cancer 
in the oxyntic stomach. Nat. Cell Biol. 19, 774-786 (2017). 

43. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for 
interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545- 
15550 (2005). 

44. Mootha, V. K. et al. PGC-1a-responsive genes involved in oxidative phosphorylation are 
coordinately downregulated in human diabetes. Nat. Genet. 34, 267-273 (2003). 

45. Monaco, G. et al. RNA-seq signatures normalized by mRNA abundance allow absolute 
deconvolution of human immune cell types. Cell Rep. 26, 1627-1640.e7 (2019). 

46. Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function 
analysis with the PANTHER classification system. Nat. Protocols 8, 1551-1566 (2013). 

47. Wang, F. et al. RNAscope: a novel in situ RNA analysis platform for formalin-fixed, paraffin- 
embedded tissues. J. Mol. Diagn. 14, 22-29 (2012). 


Acknowledgements The authors thank staff at the IMB-IMU and the SBIC-Nikon Imaging 
Centre for imaging assistance; the research coordination team and Department of Pathology 
at NUH for assistance with human samples; S. Sagiraju for assistance with animal experiments; 
A. Lin and A. Ng for empirical candidate validation; K. Saito for assistance with RNA-seq 
preparation; M. Taniguchi and K. Kita for assistance with FACS; D. H. Alpers for providing the 
GIF antibody; A. van Oudenaarden and A. Lyubimova for assistance with CEL-seq and RacelD; 
and F. de Sauvage for providing the Lgr5-DTR-eGFP mice. N.B. is supported by the Agency for 
Science, Technology and Research (A*Star), Singapore Gastric Cancer Consortium (SGCC) 
and Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 17HO1399. This 
research is supported by Singapore Ministry of Health’s National Medical Research Council 
under its Open Fund-Young Individual Research Grant (NMRC/OFYIRG/O007/2016) and the 
National Research Foundation Singapore (Investigatorship Program award no. NRF- 
NRF12017-03). 


Author contributions S.H.T and Y.S. contributed to all aspects of the study: they designed, 
performed all empirical experiments, collected and analysed data, and wrote the manuscript. 
S.1. designed and performed experiments, and collected and analysed data for profiling and 
validation studies for candidate-marker identification. J.G. performed immunostaining, CEL- 
seq experiments and mouse husbandry. R.S. provided advice and technical help with human 
and mouse cancer, analysed data and wrote the manuscript. K.M. performed FACS and 
immunostaining for human AQP5 FACS experiments. P.P. performed immunostaining and 
mouse husbandry. L.-T. performed mouse husbandry. E.W. generated the trangsenic mouse 
lines. T.S. and S.W-.H. analysed human cancer data in pathway analysis. S.L.1.J.D. analysed 
microarray, CEL-seq and RNA-seq data. S.M. performed FACS experiments. A.F. provided 
advice and technical help with human experiments and mouse cancer models. M.O., TT., 
H.LG., S.S., MT, K.GY., J.S. and A.S. provided patient samples. H.I.G., S.S. and M{T. analysed and 
scored stained patient samples. P.T. designed and supervised cancer frequency analysis. N.B. 
supervised the project, analysed the data and wrote the manuscript. All authors discussed 
results and edited the manuscript. 


Competing interests N.B. and S.HTT. are co-inventors on the provisional patent application 
10201911742W titled ‘A method for functional classification and diagnosis of cancers’. This 
patent covers the analysis of human cancers using their signalling pathway statuses. All the 
other authors declare no competing interests. 


Additional information 

Supplementary information is available for this paper at https://doi.org/10.1038/s41586-020- 
1973-x. 

Correspondence and requests for materials should be addressed to N.B. 

Reprints and permissions information is available at http://www.nature.com/reprints. 


Pylorus Small intestine 


RNA in situ hybridization 


- 
\ iy 


été 
€ 
a 
a 


Ss 


A4gnt 
Relative A4gnt levels 
o 


es eur hind 
| LLP Ns 


k < 
p=0.90 ¥ 
100} p=0.0002 
= 
2 50 ht: 
C) 


ee Sa 


_P=0.0002 _ 


Gif 
Relative Gif levels 


Muc6 
Relative Muc6 levels 


Sic9a3 
Relative Sic9a3 levels 


Spp1 
Relative Spp1 levels 


MPU 


Extended Data Fig. 1| Comparative profiling of LGR5 populationsin 
gastrointestinal tissues identifies new pyloric-specific markers. a—~c, FACS 
strategy sorting eGFP"s" and eGFP'™ cells from Lgr5-eGFP-IRES-creERT2 
pylorus (a), small intestine (b) and colon (c). d, Lgr5 expression (by qPCR) in 
sorted populations and unsorted tissues of the gastrointestinal tract. Data are 
represented as mean+s.e.m.n=4 biological replicates; one-way ANOVA. 

e, AQPS protein expression in the mouse stomach through to the duodenum by 
immunostaining. n=3 biological replicates. f-i, Agp5 mRNA expression inthe 
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corpus (f), Brunner’s glands (g), small intestine (h) and colon (i) by ISH. 

n=3 biological replicates. j—-n, A4gnt (j), Gif (k), Mucé (1), Slc9a3 (m) and Spp1 
(n) expression in the corpus, Brunner’s glands, small intestine and colon by 
qPCR, ISH and co-ISH with LGRS. For histology experiments, n=3; for qPCR, 

n=4 biological replicates for Gif, Muc6 and Slc9a3 qPCRs, for which data are 
represented as mean +s.e.m.,=2 technical replicates from a pooled sample of 
8 for A4gntand Spp1 qPCRs. Scale bars, 500 pm (e), 20 pm (f-n). 
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Extended Data Fig. 2|See next page for caption. 


Extended Data Fig. 2 | AQP5 marks the major subpopulation of LGR5"*" 
pyloric stem cells. a, Exon1 knock-in gene strategy to generate EGFP-CreERT2 
reporters of AgpS, A4gnt, Sic9a3 and Spp1 expression. b, The 3’ UTR knock-in 
gene strategy to generate 2A-eGFP, 2A—-CreERT2 or 2A-DTR reporters of AgpS, 
Slc9a3 and Lgr5 expression. c-f, eGFP signal inthe pylorus and small intestine 
of AqpS5-2A-eGFP (c, d) and Slc9a3-2A-eGFP mice (e, f). g,h, Quantification of the 
overlap between eGFP* cells and AQPS‘ cells in Slc9a3-2A-eGFP pylori (g) 
(n=102 glands from 3 mice) and arepresentative image of the immunostaining 
(h). Results are presented as mean +s.e.m. i,j, Quantification of overlap 
between LGR5-eGFP* cells with AQP5‘ cells (i) (n=117 glands from 4 mice) anda 
representative image of the immunostaining (j). Results are presented as 
mean+s.e.m.k-m, Colocalization of LGR5-eGFP in the pylorus with GIF (k), 
GAST (I) and CHGA (m). n=3 biological replicates. n, t-distributed stochastic 


neighbour embedding (¢-SNE) map of single LGR5-eGFP"®" cells from the 
pylorus. n=285 cells from 3 mice. o-s, t-SNE maps showing enrichment of 
candidate markers in major (o-q) and minor (r,s) subpopulations of LGR5"" 
pyloric cells. n=285 cells from 3 mice. t, Frequency of 10 published 
proliferation markers (Bcl2, Ccnd1, Ckap2, Foxm1, Ki67, Mcm2, Mybl2, PIk1, 
Rrm2and Top2a) in major versus minor subpopulations, compared by two- 
tailed Mann-Whitney test. n=285 cells from 3 mice, 248 cells in major, 8 cellsin 
minor-1and 29 cells in minor-2 populations. u, v, AgpS (uw) and LgrS (v) 
expression in cells sorted from Aqp5-eGFP-IRES-creERT2 pylori. Mean +s.e.m., 
n=4 biological replicates. w, x, Co-immunostaining for eGFP driven by AqpS- 
eGFP-IRES-creERT2 and endogenous AQP5 (w) and KI67 (x).n=3 mice. Scale 
bars, 50 pm (g, h), 25 pm (k, m, w, x). 
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Extended Data Fig 


Extended Data Fig. 3 | AQP5 and other newly identified pyloric markers 
label pyloric stem cells, but not other gastrointestinal stem cells, in vivo. 
a-p, Lineage tracing in A4gnt-, AqpS-, Spp1- and Slc9a3-eGFP-IRES-creERT2 mice 
crossed with tdTomato‘ reporter mice after a short trace (20-48 h) inthe 
pylorus (a-d) and small intestine (i-I), and along trace (>3 months) in the 
pylorus (e-h) and small intestine (m-p).n =3 mice per genotype. q-x, Lineage 
tracing in pylorus (q-t) and small intestine (u-x) of the AgpS-2A-creERT2 and 
Slc9a3-2A-creERT2 mice after a short trace (q-r, u, v) and along trace (s, t, w, x). 
y,z, a’, b’, Whole-mount imaging of pylorus from induced AqpS-eGFP-IRES- 


creERT2;tdTomato‘ mice 20 h (y) and 6 months (z) after induction. Whole- 
mount imaging of pylorus from uninduced 8-week-old (a’) and 8-month-old 
(b’) AgpS-eGFP-IRES-creERT2;tdTomato“ mice. n=3 mice per condition. 
tdTomato (dTom) signal through the entire height of the pyloric epitheliumis 
shown, and DAPI from the upper parts of pyloric glands is depicted for clarity. 
c’,d’, e’, tdTomato signal in clusters of glands in1-year-traced pylorus (c’) and 
small intestine (d’, e’). f”, g’, h’, tdTomato expression in gastric corpus (f’), colon 
(g’) and Brunner’s glands (h’) 20 hand 6 months after induction. n=3 biological 
replicates. Scale bars, 50 pm. 
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Extended Data Fig. 4 | See next page for caption. 
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Extended Data Fig. 4 | Detailed characterization of AQP5-expressing pyloric 
cells by transcriptomic profiling, in vivo ablation and ex vivo organoid 
culture. a, AQP5* gating strategy with cells from wild-type pylorus stained only 
with propidium iodide. b, Heat map of transcriptomes from AQP5* and AQPS” 
cells.n=4 biological replicates. c, Gene set enrichment analysis comparing the 
degree of overlap between transcriptomes of AQP5‘ cells and LGRS* cells from 
the pylorus using Kolmogorov-Smirnov statistic. n= 4 biological replicates 
each. d,e, Relative AgpS (d) and LgrS (e) expression (from microarray) in AQP5* 
and AQPS cells (n=4 biological replicates), by one-way ANOVA in the Partek 
analysis software. f-h, Relative expression of various pyloric markers (f), other 
published pyloric stem cell markers (g) and lineage and proliferation markers 
(h) in AQPS* population versus AQP5S” population in microarray. n=4 biological 


replicates. Data are represented as mean, as derived from Partek analysis 
software by one-way ANOVA. i,j, AQP5 staining ina whole-mount organoid (i) 
and AQPS colocalization with KI67 (j) inan organoid section. Organoids were 
derived from single AQPS* cells. n=3 biological replicates. k-r, Pylori of 
diphtheria-toxin-treated wild-type (k-n) and Aqp5-2A-DTR (o-r) mice stained 
for H & E(k, 0), E-cadherin (I, p), GIF (m, q) and GAST (n, r).n=3 biological 
replicates. s—u, Outgrowth efficiency of AQPS*, AQP5 , LGR5-eGFP”, LGR5- 
eGFP cells (s) n=5 biological replicates for AQP5* and AQPS cells, 

n=3 biological replicates for eGFP* and eGFP cells from Lgr5-2A-eGFP pylori. 
Representative images of organoids derived from eGFP” (t) and eGFP  (u) cells 
from Lgr5-2A-eGFP pylori. Paired two-sided t-test. Scale bars, 25 um (i,j), 50 um 
(k-r),500 um (t,u). 
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Extended Data Fig. 5 | AQP5 is expressed at human pyloric gland bases n=3 biological replicates. k, 1, AQP5 labelling in whole-mount human 
together with other pyloric markers, and facilitates the isolation of human organoids (k) and AQPS colocalization with KI67 (I) in organoid sections. 


pyloric stem cells. a-f, a’, b’,c’, d’, e’, f’, MUC6 (a,b, a’, b’), A¢GNT(c,d,c’,d’)and n=3biological replicates. m-o, Relative AX/N2(m), TFF2(n) and AQPS (0) 
SLC9A3 (e, f, e’, f’) expression (co-ISH with LGRS and immunostaining)innormal expressionin AQP5* cell-derived organoids three days after WNT3A, Noggin 
human pylorus. n=3 biological replicates. g—j, Co-ISH to colocalize AQPS5 with and FGF10 withdrawal, by qPCR. n=3 biological replicates. Scale bars, 100 pm 
PEPC(g), MUC6 (h), GIF (i) and MUCSAC (j) pyloric lineage markers. (a-j,1),25 um (a’b’, c’, d’, e’, f”, k). 
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Extended Data Fig. 6 | Profiling and validation of the transcriptome of the 
human pyloric AQP5-expressing population. a—e, qPCR validation (green) 
(n=S biological replicates) and RNA-seq values (blue) (n=8 biological 
replicates) of homologues of mouse stem cell markers (a), membrane 
components (b), chemokine signalling components (c), extracellular matrix 
components (d) and other genes (e). Two-sided Mann-Whitney test was used to 
determine statistical significance of qPCR result differences for all genes 


except AQPS, which was determined by two-tailed paired t-test. qPCRand 
RNA-seq results are presented as mean. f, f’, f’”, ISH of SMOC2 (brown) on 
normal human pylorus. f’ is a magnified inset of surface mucosa, andf” isa 
magnified inset of gland base. n=4 biological replicates. Scale bars, 100 pm 
(f), 10 pm (f”, f’”). g, Top 10 Panther pathways enriched with the most candidate 
genes. 
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a Cancer Genome Atlas b Muratani et al (2014) 
Research Network (2014) GSE 15459 
Hyperactivated Hyperactivated 
pathways Frequency pathways Frequency 
Wnt 82.6% Wnt 81.0% 
Wnt; PI3K 61.3% Wnt; PI3K 64.3% 
Wnt; PI3K; Kras 59.4% Wnt; PI3K; Kras 57.1% 
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Extended Data Fig. 7 | WNT, PI3K and KRAS pathways are commonly co- 
dysregulated in human distal gastric cancers. a, b, Co-hyperactivation status 
of the WNT, PI3K and KRAS pathways in human distal gastric cancer samples 
from TCGA}(a) (n=155) and GSE15459" (b) (n= 42) datasets. Heat maps show 
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distribution of pathway hyperactivation status across samples. Graphs depict 
distribution of z scores (degree of signalling activity) of normal and tumour 
samples for each of the pathways we examined. 


a APC (A) APC;PTEN(AP) APC;PTEN;Kras (APK) 


IRES-CreERT2 n =4/5, 2 invasive n=5/7 n=10/11, 1 invasive 
(Agp5, A4gnt) 7-11 mths 4.5-6 mths 2-4.5 mths 
2A-CreERT2 n=3/3, 3 invasive n=6/6, 5 invasive n=3/3 
(Aqp5, Slc9a3) 3.5-8 mths 2-3.5 mths 1-1.5 mths 
IRES-CreERT2 2A-CreERT2 
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Extended Data Fig. 8| Targeted conditional mutation of pyloric stem cells combination. h-l, H & Eimages of pylori from multiple-pyloric-marker- 
using our mouse models selectively drives tumour formation in the distal creERT2;Kras‘*“©”° models (h, i), APK-only model (without creERT2 driver) (j), 


stomach. a, Sample sizes, tumour and invasion incidences observed in various and small intestine (k) and colon (1) from AqpS5-/RES-creERT2 APK mouse model 
permutations of creERT2 drivers and oncogenic alleles. bp-g, Whole-mount and of gastric cancer. Scale bars, 1cm (whole-mount insets in b-g), 200 pm 
H &Eimages of entire pyloric regions for each creERT2-oncogenic-allele (H&Eimagesinb-l). 
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Extended Data Fig. 9 | Phenotypic characterization of Aqp5-IRES- 
creERT2 APK and Slc9a3-2A-creERT2 AP distal stomach tumours. 

a-g, Immunostaining of various markers in Aqp5-IRES-creERT2 APK pyloric 
tumour. h, h’, h”, Co-ISH of AQP5 and LGRS in tumour region; region in black 
box in his magnified inh’. Dual ISH of AQP5 and LGRS inan adjacent normal 
pyloric region from the same mouse (h”).i, i’, Representative H & E stain 
ofasalivary gland tumour from Aqp5-/RES-creERT2 APK mouse. 

j-o, Immunostaining of various markers in Slc9a3-2A-creERT2 AP pyloric 
tumour. p-w, Immunostaining of various markers inthe APK-only (no creERT2) 
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FACS gating for eGFP* cells using normal Aqp5-eGFP-IRES-creERT2 (y) and 
wild-type (z) pylori. n=3 biological replicates. y, z,z’,z’”’, Organoid assay 

for stemness of AQP5-eGFP* tumour cells. n=3 biological replicates. 

y, Experimental timeline. FACS gating strategy to isolate eGFP* tumour cells 
from Aqp5-creERT2; APK pyloric tumour (z), eGFP* cells from normal AqpS- 
eGFP-IRES-creERT2 pylorus (z’), and control GFP gating with wild-type pylorus 
(z’’). Scale bars, 100 pm (a-g, i-x,h’, i’, x’), 20 pm (h”). 


a Total cores AQP5+ cores Cellular localization (% of AQP5+ cores) 

7 # # % Cytoplasm Membrane Nucleus Mixed* 

Intestinal 77 56 72.7 96.4 42.9 1.8 39.3 

Diffuse 44 31 70.5 93.5 29.0 6.5 32.3 

Mixed 24 16 66.7 100.0 37.5 6.3 43.8 

Overall 145 103 71.0 96.1 37.9 3.9 37.9 
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Extended Data Fig. 10 | AQP5 expression is commonly dysregulated in 
human distal gastric cancers. a, Summary of AQPS5 expressionin atumour 
microarray panel of 145 cores of human distal gastric cancer. AQP5 expression 
is scored as positive if observed in >5% of malignant cells. b—e, Examples of 
AQP5* cores with intestinal (b, c) and diffuse (d, e) subtypes, often with 
cytoplasmic and/or membranous staining. f, Summary of AQPS expression 
from 54 full sections of distal human gastric cancer. *Mixed refers to AQP5 
localization in cytoplasm and nucleus, or cytoplasm and membrane. g-n, AQPS 


expression in intestinal (g,h,k, 1) and diffuse (i,j, m, n) subtypes. Yellow 
arrowheads indicate cells co-expressing AQPS and KI67. 0, Summary of other 
observations of AQPS expression in full gastric tumour sections. p-s, AQPS 
expression in the invasive edge of the tumour (p), intestinal metaplasia (IM) (q) 
(dotted lines denote intestinal metaplasia region that is negative for AQP5), 
Signet ring cells (r) (black arrows denote cells with weak AQPS expression) and 
tumour cells inlymph node metastasis (s). Scale bars, 20 pm (g-n, q,r),50 um 
(b-e, p,s). 
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Statistical parameters 


When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main 
text, or Methods section). 


n/a | Confirmed 
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 
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A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND 
variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
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For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 
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Software and code 


Policy information about availability of computer code 


Data collection BD FACS Sortware sorter software (v1.1), QuantStudio Real-Time PCR Software (v1.3), Nikon NIS-Elements AR (v5.11), Zeiss ZEN Blue (v2), 
Olympus FV31-SW (v2.3), Olympus DP2-BSW (v2.2) 


Data analysis Software: Partek Genomics Suite, GraphPad Prism (v5.03 and v8.1) and ImageJ (v1.52a), Bitplane (v8.0), Nikon NIS-Elements AR (v5.11) 
QuantStudio Real-Time PCR Software (v1.3) 
Open source code: RacelD (https://github.com/dgrun/RacelD), edgeR version 3.12.1 (with limma_3.26.9), PANTHER (v14.1), GSEA (v6.1) 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers 
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 
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Data 
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All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- A description of any restrictions on data availability 


The datasets generated during and/or analysed during the current study are available in the GEO repository under accession codes GSE121803 and GSE133036. The 
datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request 


Field-specific reporting 


Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


DX] Life sciences [_] Behavioural & social sciences [| Ecological, evolutionary & environmental sciences 
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Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 

Sample size No sample size calculation was performed. All qualitative experiments (e.g. immunostaining) were done with at least three biological 
replicates with all showing the same outcome, indicating applicability to the broader sample. We included as many independent replicates as 
possible (3 or more) in quantitative experiments, and used statistical tests to determine if any observable difference was statistically 
significant. Sample size is deemed sufficient with a statistically significant difference. 

Data exclusions In principle, data were only excluded for failed experiments resulting from technical error. 

Replication We performed experiments on at least 3 biological replicates to ensure reproducibility of results. 

Randomization No randomization of mice. Mice analyzed were litter mates and sex-matched whenever possible. 

Blinding Investigators were not blinded to mouse genotypes and patient conditions during experiments. Data reported for mouse experiments are not 


subjective but based on experimental observations. Investigators needed knowledge of patients’ conditions (healthy or cancerous) to 
ascertain applicability of human tissues collected for downstream experiments. 
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Obtaining unique materials | Unique mouse models are available from corresponding author at requests. 
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used: 

chicken anti-EGFP (1:2,000, Abcam, ab290), rabbit anti-EGFP (1:200; Cell Signalling, 2956S), rabbit anti-Ki67 (1:200; 
Thermofisher, MAS-14520), rabbit anti-GIF (1:10,000; provided by D. H. Alpers, Washington University School of Medicine, USA), 
rabbit anti-RFP (1:200; Rockland, 600-401-379), rabbit anti-aquaporin5 (1:200, Santa Cruz, SC-28628), rabbit anti-Slc9a3 (1:200; 
Santa Cruz, SC-16103-R), rabbit anti-mucin6 (1:200; LsBio, LS-C312108), rabbit anti-A4gnt (1:500, Novus Biologicals, 
NBP1-89129), rabbit anti-Gastrin (1:200, Novocastra, NCL-GASp), mouse anti- MUCSAC (1:200; Novocastra, NCL-HGM- 45-M1), 
rabbit anti-vimentin (1:500; Abcam, ab92547), mouse anti-E-cadherin (1:200; BD Transduction Laboratories, 610181), mouse 
anti-B-catenin (1:200; BD Transduction Laboratories, 610154), mouse anti-RFP (1:200; Abcam, 129244), mouse anti-ChgA (1:200; 
Abcam, 15160), rabbit anti-Phospho-MAPK (1:200; Cell Signalling, 4370S), mouse anti-H-K-ATPase (1:1,000; MBL International 
Corporation, DO32-3), rabbit anti-Phospho-Akt (1:200; Cell Signalling, 3787L) 


Validation The antibodies were validated by the relevant companies and show expected staining patterns and cellular localization in our 
experiments. The requested details have been summarised in Supplementary Table 6. 
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Laboratory animals Adult B6 mice (7 weeks and older) of both genders were used in the study. All animal experiments were approved by the 
Institutional Animal Care and Use Committee of Singapore. 


Wild animals The study did not involved wild animals. 


Field-collected samples The study did not involve samples collected from the field. 


Human research participants 


Policy information about studies involving human research participants 


Population characteristics Healthy gastric mucosa or gastric adenocarcinomas from patients’ pyloric antrum were collected for the study. State of health 
was determined by a clinical pathologist. Healthy mucosa was collected in Advanced DMEM/F12 media for processing in lab, 
while gastric adenocarcinomas were collected in 4% paraformaldehyde for FFPE processing. 


Recruitment Participants were recruited by the National University Hospital research coordination team, and provided by National University 


Singapore Department of Medicine and Pathology (granted under protocol-11-167E). Informed consent was obtained from all 
patients. 


Flow Cytometry 


Plots 


Confirm that: 


The axis labels state the marker and fluorochrome used (e.g. CD4-FITC). 


The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a 'group' is an analysis of identical markers). 


All plots are contour plots with outliers or pseudocolor plots. 
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Methodology 


Sample preparation Murine pylorus was incubated in chelation buffer (5.6 mM sodium phosphate, 8 mM potassium phosphate, 96.2 mM sodium 
chloride, 1.6 mM potassium chloride, 43.4 mM sucrose, 54.9 mM D-sorbitol, 1 mM dithiothreitol) with 5 mM EDTA at 40C for 2 
hours. Glands were isolated by repeated pipetting of finely chopped pylorus tissue in cold chelation buffer. Chelation buffer 
containing isolated glands was filtered through 100um filter mesh, and centrifuged at 720g at 40C for 3 min. The pellet was 
resuspended in TrypLE (Life Technologies) with DNasel (0.8U/uL)(Sigma) and incubated at 370C for 10 min with intermittent 
trituration for digestion into single cells. Digestion was quenched by dilution with cold HBSS buffer. The suspension was 
centrifuged at 720g at 40C for 3 min. For anti-Aqp5 antibody stain, the pellet was resuspended in HBSS with 2% fetal bovine 
serum (FBS, Hyclone) with anti-Aqp5-AF647 (Abcam) at 1:500 dilution and incubated on ice at 30min in the dark. The pellet was 
subsequently washed twice with cold HBSS and spun at 800g for 3min at 40C. The pellet was resuspended in HBSS with 2% fetal 
bovine serum (FBS, Hyclone). Before sorting, 1 ug/ml propidium iodide (Life Technologies) was added to the cell suspensions, 
filtered through a 40 um strainer, and sorted on BD Influx Cell Sorter (BD Biosciences). 


Instrument BD Influx 
Software BD FACSDiva for collection and population analysis 
Cell population abundance Sorts with GFP-reporter mice required one pylorus per sample. Sorts using antibodies required four pylori. Depending on the 


marker, the yield ranged from 1000 -3000 positive cells per experiment. Negative population is always in abundance. Cell 
populations collected were subsequently confirmed by qPCR. 


Gating strategy Positive/Negative gating strategy was defined with wildtype or unstained cells, and subsequently confirmed by qPCR. 


Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information. 
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Metformin, the world’s most prescribed anti-diabetic drug, is also effective in 
preventing type 2 diabetes in people at high risk’*. More than 60% of this effect is 
attributable to the ability of metformin to lower body weight ina sustained manner’. 
The molecular mechanisms by which metformin lowers body weight are unknown. 
Here we show—in two independent randomized controlled clinical trials—that 
metformin increases circulating levels of the peptide hormone growth/differentiation 
factor 15 (GDF1S5), which has been shown to reduce food intake and lower body weight 
througha brain-stem-restricted receptor. In wild-type mice, oral metformin increased 


circulating GDF15, with GDF15 expression increasing predominantly in the distal 
intestine and the kidney. Metformin prevented weight gain in response to a high-fat 
diet in wild-type mice but notin mice lacking GDF15 or its receptor GDNF family 
receptor a-like (GFRAL). In obese mice ona high-fat diet, the effects of metformin to 
reduce body weight were reversed by a GFRAL-antagonist antibody. Metformin had 
effects on both energy intake and energy expenditure that were dependent on GDF1S5, 
but retained its ability to lower circulating glucose levels in the absence of GDF15 
activity. Insummary, metformin elevates circulating levels of GDF15, which is 
necessary to obtain its beneficial effects on energy balance and body weight, major 
contributors to its action as a chemopreventive agent. 


Metformin has been used as a treatment for type 2 diabetes since the 
1950s. Recent studies have shown that it can also prevent or delay the 
onset of type 2 diabetes in people at high risk!”. At-risk individuals 
treated with metformin exhibit a reduction in body weight, glucose 
and insulin levels and enhanced insulin sensitivity’. Although many 
mechanisms for the insulin-sensitizing actions of metformin have 
been proposed‘, they do not explain the weight loss. The robustness 
and persistence of metformin-induced weight loss in participants in 
the Diabetes Prevention Program has drawn attention to its impor- 
tance to the chemopreventive effects of the drug’. A recent observa- 
tional epidemiological study’ noted a strong association of metformin 
use with circulating levels of GDF15, a peptide hormone produced by 
cells responding to stressors’. GDF15 acts through a receptor complex 
that is expressed solely in the hindbrain, through which it suppress 
food intake®". We hypothesized that the effects of metformin in 


lowering body weight may involve the elevation of circulating levels 
of GDF15. 


Human studies 

We first measured circulating GDF15 in a short-term human study 
and found that, after two weeks of metformin treatment, there was 
an increase of about 2.5-fold in mean circulating GDF15 (Fig. 1a). 
To determine whether this increase was sustained, we measured circu- 
lating GDF15 levels at 6, 12 and 18 months in all available participants in 
carotid atherosclerosis: metformin for insulin resistance (CAMERA)®, 
an-18 month randomized placebo-control trial of metformin in people 
without diabetes but with a history of cardiovascular disease. In this 
study, metformin-treated participants lost about 3.5% of body weight 
with no significant change in weight in the placebo arm”. Metformin 
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Fig. 1| Effect of Metformin on circulating GDF15 levels in humans and mice. 
a, Paired serum GDF15 concentration in nine human subjects after two weeks of 
either placebo or metformin treatment, P value (95% Cl) by two-tailed ¢-test. 

b, Plasma GDF15 concentration in overweight or obese non-diabetic 
participants with known cardiovascular disease, randomized to metformin or 
placebo in CAMERA, using a mixed linear model. Data are mean +s.e.m. Subject 
numbers: placebo and metformin, respectively, at time points: baseline, n=85 
and 86; 6 months, n= 81and 71; 12 months, n=77 and 68; 18 months, n= 83 and 
74. Comparing metformin vs placebo groups, two-sided P= 0.311 at baseline, 
and P<0.0001at 6,12 and 18 months individually. c, Serum GDF15 levels 

(mean +s.e.m.) in obese mice measured 2, 4, 8 or 24 hafter a single oral dose of 
300 mg kg? or 600 mg kg" metformin, n=7 per group, Pvalues by 2-way 
ANOVA with Tukey’s correction for multiple comparisons. 


treatment was associated with significantly (P< 0.0001) increased lev- 
els of circulating GDF15 at all three time points (Fig. lb, Extended Data 
Fig. lb—e). Furthermore, the change in serum GDF15 from baseline in 
metformin recipients was significantly correlated (R=—0.26, P=0.024) 
with weight loss (Extended Data Fig. 1a). 

The correlation of GDF15 increment with changes in body weight, 
while statistically significant, was modest in size. Although we believe 
that it does contribute to weight loss in some individuals taking met- 
formin, we acknowledge that it is not necessary, and there are individu- 
als with increases in GDF15 that do not exhibit weight loss. However, in 
the context of a long-term human study with imperfect drug compli- 
ance and intermittent sampling of GDF15 levels, it is noteworthy that 
such an association was seen at all. Further, there was no association of 
weight change with change in GDF15 in the placebo group (R =—0.04, 
P=0.740,n=81). 


Mouse studies 

Following these findings in humans, we performed a series of animal 
experiments to determine the potential causal link between the changes 
in GDF1S and weight changes induced by metformin. We administered 
metformin by oral gavage to mice fed a high-fat diet and measured 
serum GDFIS. A single dose of 300 mg kg? of metformin increased 
GDF15 levels for at least 8 h (Fig. 1c). A higher dose of metformin, 
600 mg kg, resulted ina sixfold increase in serum GDF15 levels at 4h 
and 8 h after the dose, which were sustained above those of vehicle- 
treated mice for 24 h. The effects of metformin in chow-fed mice were 
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Fig. 2| GDF15-GFRAL signalling is required for the weight-loss effects of 
metformin ona high-fat diet. a, Percentage change in body weight of Gdf15"* 
and Gdf15“ mice ona high-fat diet treated with metformin (300 mg kg“ day”) 
for 11 days. Data are mean +s.e.m.,n=6 per group except Gdf15”* vehicle, n=7; 
Pvalues by two-way ANOVA with Tukey’s correction for multiple comparisons. 
b, Cumulative food intake of mice asina, Pvalues by two-way ANOVA with 
Tukey’s correction for multiple comparisons. c, Percentage change in body 
weight of Gfral’” and Gfral’ mice ona high-fat diet treated with metformin 
(300 mg kg‘ day”) for 11 days. Dataare mean +s.e.m.,n=6 per group; Pvalues 
by two-way ANOVA with Tukey’s correction for multiple comparisons. 

d, Percentage change in body weight of metformin-treated obese mice dosed 
with an anti-GFRAL antagonist antibody weekly for five weeks (yellow), starting 
four weeks after initial metformin exposure (grey). Dataare mean +s.e.m., 
vehicle + controllgG and metformin + anti-GFRAL, n=7; other groups, n=8; 
Pvalues by two-way ANOVA with Tukey’s correction for multiple comparisons. 
Calo, period in which energy expenditure measured (see e); arrow, start of oral 
GTT (Fig. 3e-h). e, ANCOVA of energy expenditure against body weight of mice 
treated asin d.n=6 mice per group. Data points show individual mice; Pvalues 
for metformin calculated using ANCOVA with body weight as a covariate and 
treatment asa fixed factor. 


less pronounced (Extended Data Fig. 2), suggesting an interaction 
between metformin and the high-fat diet. 

To determine the extent to which metformin-induced increase in 
GDF15 affects body weight, Gdf15"" and Gdf15’ mice were switched 
from chowto a high-fat diet and dosed with metformin for 11 days. The 
high-fat diet induced similar weight gain in both genotypes (Fig. 2a). 
Metformin completely prevented weight gain in Gdf15‘* mice, but 
Gdf15‘- mice were insensitive to the weight-reducing effects of met- 
formin (Fig. 2a, Extended Data Fig. 3a). Metformin significantly reduced 
cumulative food intake in wild-type mice but this effect was abolished 
in Gdf15" mice (Fig. 2b). 

The identical protocol was applied to mice lacking GFRAL, the 
ligand-binding component of the hindbrain-expressed GDF15 receptor 
complex. Consistent with the results in mice lacking GDF15, metformin 
was unable to prevent weight gain in Gfral” mice (Fig. 2c, Extended 
Data Fig. 3b), despite similar levels of serum GDF15 to wild-type mice 
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Fig. 3 | Effects of metformin on glucose homeostasis. a, ITT (0.5 Ukg ‘ insulin) 
after 11 days of metformin treatment (300 mg kg‘) in Gdf15"* and Gdf15‘ mice 
ona high-fat diet. Data are mean+s.e.m.,n=6 per group, except Gdf15”~ 
vehicle, n=7 and Gdf15‘" vehicle, n=5.b, AUC analysis of glucose over time in 
mice froma. Data are mean +s.e.m.; Pvalues by two-way ANOVA; interaction of 
genotype and metformin, P= 0.037. c, Fasting glucose (time 0) of ITT froma. 
Data are mean+s.e.m.; Pvalues by two-way ANOVA, effect of genotype, 
P=0.144; interaction of genotype and metformin, P= 0.988. d, Fasting insulin 
(time 0) inITT froma. Data are mean +s.e.m.; Pvalues by two-way ANOVA, 
effect of genotype; P= 0.131; interaction of genotype and metformin, P=0.056. 
e, f, Glucose over time after oral GTT in metformin-treated obese mice given 
either IgG (e) or anti-GFRAL (f) once weekly for five weeks (as in Fig. 2d). 


(Extended Data Fig. 4a, b). In this experiment, the reduction in cumula- 
tive food intake did not reach statistical significance (Extended Data 
Fig. 4c). 

To investigate the contribution of GDF1I5-GFRAL signal- 
ling to sustained, metformin-dependent weight regulation, we 
performed a 9-week study in which mice received approximately 
250-300 mg kg? day" of metformin incorporated into their high-fat 
diet. The mice lost around 9% of their body weight after 1 month on this 
diet (Fig. 2d Extended data Fig. 3c). At this time, an anti-GFRAL antago- 
nist antibody or IgG control was administered. Metformin-consuming 
mice treated with anti-GFRAL regained about 12% of body weight after 
5 weeks, whereas the weight loss seen in IgG control treated mice was 
maintained, reaching approximately 7% below the starting weight 
(Fig. 2d). The significant reduction in fat mass seen with metformin 
treatment and control antibody was not seen in the anti-GFRAL group. 
(Extended Data Fig. 4d). The delivery of metformin in chow resulted 
in an initial reduction in food intake in all metformin-treated groups, 
presumably because of a taste effect. This reduction in food intake 
will have affected metformin levels and probably affected GDF15 
levels, with potential to bias the results. However, it is reassuring 
to note that any persistence of this would have worked against the 
detection of a specific effect of GFRAL antagonism, which was clearly 
demonstrable. 

We undertook indirect calorimetry in metformin- and placebo- 
treated mice treated with anti-GFRAL antibody to establish whether 
there are additional effects on energy expenditure. Data were analysed 
by analysis of covariance (ANCOVA) with body weight as the covariate. 
Metformin treatment resulted in a significant increase in metabolic 
rate, which was blocked by antagonism of GFRAL (Fig. 2e). Thus under 
conditions in which GDF1S levels are increased by metformin, body 
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AUC analysis by two-way ANOVA; effect of antibody, P= 0.031; effect of 
metformin, P= 0.072; interaction of antibody and metformin, P= 0.91. 

g,h, Insulin over time after oral GTT in mice treated as ine, f. Dataare 

mean +s.e.m. i, Fasting insulin (time O) after GTT in mice treated asine, f. Data 
are mean +s.e.m., Pvalues by two-way ANOVA; effect of antibody, P=0.544; 
interaction of genotype and metformin, P= 0.691.j, AUC analysis of insulin over 
timeing,h. Data are mean+s.e.m.; Pvalues by two-way ANOVA; effect of 
antibody, P=0.197; interaction of genotype and metformin, P= 0.607. 

k, I, Glucose over time after intraperitoneal GTT in mice ona high-fat diet 
givena single dose of oral metformin (300 mg kg‘) 6hbefore the GTT. 

Data are mean+s.e.m.,n=8 per group. 


weight reduction is contributed to by both reduced food intake and 
an inappropriately high energy expenditure. 


GDF15 and glucose homeostasis 


To examine the extent to which the insulin-sensitizing effects of 
metformin are dependent on GDFIS, we repeated the experiment 
described in Fig. 2a (see Extended Data Fig. 5), measuring insulin tol- 
erance in metformin- and vehicle-treated GDF15-null mice and their 
wild-type littermates (Fig. 3a). Circulating metformin levels in both 
genotypes were identical (Extended Data Fig. 5d) and consistent with 
the high end of the human therapeutic range“. Metformin significantly 
increased insulin sensitivity, as assessed by the area under the plasma 
glucose curve, with no significant effect of genotype (Fig. 3b). Simi- 
larly, metformin reduced fasting blood glucose and fasting insulin in 
a GDF15-independent manner (Fig. 3 c, d). 

We also performed oral glucose-tolerance tests (GTTs) on metformin- 
treated mice given either control IgG or anti-GFRAL antibody for 
five weeks (Figs. 2d, 3e, f, Extended Data Fig. 6a). Although the 
effect of metformin glucose disposal on the oral GTT as assessed by 
the area under the plasma glucose curve did not reach statistical sig- 
nificance (two-way ANOVA, P= 0.072), there was a significant effect 
of metformin on insulin (both fasting level and area under the curve 
(AUC)) after glucose bolus, that was independent of anti GFRAL anti- 
body (Fig. 3 g-j). 

As these mice had different body weights at the time of assessment 
(Fig. 2d, Extended Data Fig. 3c), we performed intraperitoneal GTTsina 
cohort of weight-matched Gdf15* and Gdf15“ mice that had been feda 
high-fat diet for two weeks before receiving a single dose of metformin 
(300 mg kg?) (Fig. 3k, I, Extended Data Fig. 6b-d). Inthese mice, there 


was a significant effect of metformin on glucose levels (plasma glucose 
AUC) that was independent of GDF15 (Extended Data Fig. 6e). 

The effect of metformin in decreasing fasting glucose and insulin and 
improving glucose tolerance do not require GDF15. Given the a priori 
expected effect of weight loss on insulin sensitivity it is noteworthy 
that the effect of GDF15 status on insulin sensitivity as measured by 
insulin-tolerance test (ITT) (Fig. 3b) fell just short of statistical signifi- 
cance. In the follow up of the Diabetes Prevention Program study in 
non-diabetic individuals, weight loss after 5 years of metformin therapy 
was approximately 6.5% of baseline weight’. We therefore estimated 
the effect of a 6.5% weight loss on improvements in fasting insulin over 
5 years in the Ely Study, a prospective observational population-based 
cohort study of men (n=465) and women (n= 634) inthe UK (mean age 
52 years, mean body mass index 26 at baseline)’, showing that this mag- 
nitude of weight loss was associated with a reduction in fasting plasma 
insulin of -5.74 (—9.03, -2.45) pmol I (mean +95% confidence interval 
(Cl)) in women and -8.78 (-16.24, -1.33) pmol lin men. We conclude 
that although there are GDF15-independent effects of metformin on 
circulating levels of glucose and insulin, GDF15-dependent weight loss 
probably contributes to enhancing insulin sensitivity. 


Source of GDF15 production 


We examined Gdf15 gene expression in a tissue panel obtained from 
mice fed a high-fat diet (for four weeks) and euthanized 6h after a single- 
gavage dose of metformin (600 mg kg”). Circulating concentrations 
of GDF15 increased about 5.5-fold compared with vehicle-treated mice 
(Extended Data Fig. 6f) and Gdf15 mRNA was significantly increased 
by metformin in small intestine, colon and kidney (Fig. 4a). In situ 
hybridization studies demonstrated strong Gdf15 expression in crypt 
enterocytes inthe colon and small intestine and in periglomerular renal 
tubular cells (Fig. 4b, Extended Data Fig. 7a, b). We confirmed these sites 
of tissue expression in mice fed a high-fat diet (those used in Fig. 2a) and 
treated with metformin for 11 days (Extended Data Fig. 8). Further, in 
organoids derived from human (Fig. 4c) and mouse (Fig. 4d) intestine, 
grownintwo-dimensional (2D) transwells and treated with metformin, 
we observed a significant induction of Gdf15 mRNA expression and 
GDF1S5 protein secretion. 

Given the proposed importance of the liver for the metabolic action 
of metformin, it was notable that the dominant GDF15 expression signal 
was not fromthe liver (Fig. 4a, Extended Data Figs. 7a, 8). To determine 
whether hepatocytes are capable of responding to biguanide drugs with 
anincrease in GDF15, we incubated freshly isolated mouse hepatocytes 
(Extended Data Fig. 9a) and stem-cell derived human hepatocytes 
(Extended Data Fig. 9b) with metformin and found a clear induction 
of GDF15 expression. Additionally, acute administration of the more 
cell-penetrant biguanide drug phenformin to mice increased circu- 
lating GDF15 levels (Extended Data Fig. 9c) and markedly increased 
Gdf15 mRNA expression in hepatocytes (Extended Data Fig. 9d, e). We 
conclude that biguanides can induce GDF15 expression in many cell 
types but, at least when given orally to mice, Gdf15 mRNA is mostly 
induced in the distal small intestine, colon and kidney. 

GDF15 expression has been reported to be a downstream target of 
the cellular integrated stress response (ISR) pathway’*8.Gdf15 mRNA 
levels were increased in kidney and colon 24 hafter a single oral dose of 
metformin and these changes correlated positively with the fold eleva- 
tion of Chop (also known as Ddit3) mRNA (Extended Data Fig. 10a, b). As 
phenformin has broader cell permeability than metformin”, we used 
it to explore the effects of biguanides on the ISR and its relationship 
to GDF15 expression in cells. In mouse embryonic fibroblasts (MEFs), 
which do not express the organic cation transporters needed for the 
uptake of metformin, phenformin (but not metformin) increased 
EIF2a phosphorylation, ATF4 and CHOP expression (Extended Data 
Fig. 10c) and Gdf15 mRNA (Extended Data Fig. 10d), though the changes 
in EIF2a phosphorylation and ATF4 and CHOP expression were modest 
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Fig. 4 | Metformin increases GDF15 expression in the enterocytes of distal 
intestine and in renal tubular epithelial cells. a, Gdf15 mRNA expression 
(normalized to expression levels of Actb) 6 hafter a single dose of oral 
metformin (600 mg kg”) in tissues from wild-type mice ona high-fat diet. Data 
are mean+s.e.m.,n=7 per group; Pvalue (95% Cl) by two-tailed t-test. b, Insitu 
hybridization for Gdf15 mRNA (red spots). n=7 per group. Representative 
images from the mouse with circulating GDF15 level closest to group median, 
treated with vehicle or metformin. Mice are from groups described ina. 

c, GDF15 mRNA expression (left) and GDF15 protein in supernatant (right) of 
human-derived 2D monolayer rectal organoids treated with metformin. Each 
colour represents an independent experiment. Data are meants.d.,n=4; 
Pvalues (95% Cl) by two-tailed t-test. d, GDF15 protein in supernatants of 
mouse-derived 2D monolayer duodenal (left) and ileal (right) organoids 
treated with metformin. Each colour represents an independent experiment. 
Data are mean+s.d., duodenal, n=5; ileal, n=3; Pvalues (95% Cl) by two-tailed ¢-test. 


compared with those induced by tunicamycin despite similar levels 
of Gdf15 mRNA induction. Both genetic deletion of Atf4 and small 
interfering (si)RNA-mediated knockdown of Chop significantly 
reduced phenformin-mediated induction of Gdf15 mRNA expression 
(Extended Data Fig. 10e, f). Inaddition, phenformin induction of GDF15 
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was markedly reduced by co-treatment with the EIF2q inhibitor ISRIB 
but, notably, not by the PERK inhibitor GSK2606414 (Extended Data 
Fig. 10g). Further, GDF15 secretion in response to metformin in mouse 
duodenal organoids was also significantly reduced by co-treatment 
with ISRIB (Extended Data Fig. 10h). However, gut organoids derived 
from CHOP-null mice are still able to increase GDF15 secretion in 
response to metformin (Extended Data Fig. 10i) indicating the exist- 
ence of CHOP-independent pathways under some circumstances. These 
data suggest that the effects of biguanides on GDF15 expression are at 
least partly dependent on the ISR pathway but are independent of PERK. 
However, the relative importance of components of the ISR pathway 
may vary depending on specific cell type, dose and agent used. 

Our observations represent an advance in the understanding of the 
action of metformin, one of the world’s most frequently prescribed 
drugs. Metformin increases circulating GLP1 levels”° ”, but its meta- 
bolic effects in mice are unimpaired in mice lacking the GLP1 receptor”. 
Metformin alters the intestinal microbiome**” but it is challenging 
to firmly establish a causal relationship between this effect and the 
beneficial effects of the drug”’. 

In this study, we present a body of data from humans, cells, 
organoids and mice that securely establish a major role for GDF15 in 
the mediation of the beneficial effects of metformin on energy balance. 
Whereas these effects probably contribute to the role of metformin 
as an insulin sensitizer, it has other effects in decreasing glucose and 
insulin in the absence of GDF1S. 

Whereas many mechanisms have been suggested for the glucoregula- 
tory mechanisms of metformin’, there has been less attention paid to 
its effects on weight. Our discoveries relating to metformin’s effects 
via GDF15 provide a compelling explanation for this important aspect 
of its action. 

Itis notable that the lower small intestine and colon area major site 
of metformin-induced GDF15 expression. An emerging body of work 
strongly implicates the intestine as a major site of metformin action. 
Metformin increased glucose uptake into colonic epithelium fromthe 
circulation’’ and a gut-restricted formulation of metformin had greater 
glucose-lowering efficacy than systemically absorbed formulations”. 
Our finding that the intestine is a major site of metformin-induced 
GDF15 expression provides a further mechanism through which met- 
formin’s action on the intestinal epithelium may mediate some of its 
benefits. 
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Methods 


Human studies 

We analysed samples from nine participants from a study with a 
placebo-controlled, double-blind crossover design (previously 
described in ref. ”). In brief, placebo or metformin (week 1, 500 mg 
twice daily; week 2, 1,000 mg twice daily) was administered following 
a six-week period of washout. Samples were collected in the morning 
after overnight fasting. The study was approved by the Mayo Clinic Insti- 
tutional Review Board and all participants provided written, informed 
consent (NCT0O1956929). 

CAMERA was arandomized, double-blinded, placebo-controlled trial 
designed to investigate the effect of metformin on surrogate markers 
of cardiovascular disease in patients without diabetes, aged 35 to 75, 
with established coronary heart disease and a large waist circumference 
(>94 cmin men, >80 cmin women) (NCT00723307). This single-centre 
trial enrolled 173 adults who were followed up for 18 months each. 
A detailed description of the trial and its results has been published 
previously”. In brief, participants were randomized 1:1 to 850 mg 
metformin or matched placebo twice daily with meals. Participants 
attended six monthly visits after overnight fasts and before taking 
their morning dose of metformin. Blood samples collected during 
the trial were centrifuged at 4 °C soon after sampling, separated and 
stored at -80 °C. 

All participants provided written informed consent. The study was 
approved by the Medicines and Healthcare Products Regulatory Agency 
and West Glasgow Research Ethics Committee, and done in accordance 
with the principles of the Declaration of Helsinki and good clinical 
practice guidelines. 

Serum GDF15 assays were measured ina fashion blinded to treatment 
allocation or timing of samples by the Cambridge Biochemical Assay 
Laboratory, University of Cambridge. Measurements were performed 
with antibodies and standards from R&D Systems (R&D Systems) using 
amicrotitre-plate-based two-site electrochemiluminescence immuno- 
assay using the MesoScale Discovery assay platform (MSD). 


Mouse studies 
Studies were carried out at two sites: NGM Biopharmaceuticals, 
California, and the University of Cambridge. 

At NGM, all experiments were conducted with NGMIACUC approved 
protocols and all relevant ethical regulations were complied with 
throughout the course of the studies, including efforts to reduce the 
number of animals used. Experimental animals were kept under con- 
trolled light (12 h:12 h light:dark cycle, dark 18:30-06:30), temperature 
(22+3 °C) and humidity (50 + 20%) conditions. They were fed ad libitum 
on 2018 Teklad Global 18% Protein Rodent Diet containing 24 kcal% fat, 
18 kcal% protein and 58 kcal% carbohydrate, or on high-fat rodent diet 
containing 60 kcal% fat, 20 kcal% protein and 20 kcal% carbohydrates 
from Research Diets D12492i, hereafter referred to as 60% HFD. 

In Cambridge, all mouse studies were performed in accordance with 
UK Home Office Legislation regulated under the Animals (Scientific 
Procedures) Act 1986 Amendment, Regulations 2012, following ethi- 
cal review by the University of Cambridge Animal Welfare and Ethical 
Review Body (AWERB). They were maintained in a 12 h:12 h light:dark 
cycle (lights on 07:00-19:00), temperature-controlled (22 °C) facil- 
ity, with ad libitum access to food (RM3(E) Expanded Chow (Special 
Diets Services)) and water. Any mice bought from an outside supplier 
were acclimatised ina holding room for at least one week before study. 
During study periods they were fed ad libitum high-fat diet, either 
D12451i (45 kcal% fat, 20 kcal% protein and 35 kcal% carbohydrates, 
herein referred to as 45% HFD) or D12492i (Research Diets) as high- 
lighted in the individual study. 

Sample sizes were determined on the basis of homogeneity and 
consistency of characteristics in the selected models and were suf- 
ficient to detect statistically significant differences in body weight, 


food intake and serum parameters between groups. Experiments were 
performed with animals of a single gender in each study. Animals were 
randomized into the treatment groups on the basis of body weight 
such that the mean body weights of each group were as close to each 
other as possible, but without using an excess number of animals. No 
samples or animals were excluded from analyses. Researchers were 
not blinded to group allocations. 


Mouse study 1, acute two-dose metformin and high-fat diet 

Male C57BI6/J mice fed 60% HFD for 17 weeks were studied aged 
23 weeks (body weight, mean +s.e.m., 45.6 + 0.8g). Metformin (Sigma- 
Aldrich no. 1396309) was reconstituted in water at 30 mg mI for oral 
gavage and given in the early part of the light cycle. Terminal blood 
was collected by cardiac puncture into EDTA-coated tubes. GDF15 
levels were measured using Mouse/Rat GDF15 Quantikine ELISA Kit 
(no. MGD-150, R&D Systems) according to the manufacturers’ instruc- 
tions. RNA was isolated from tissues using the Qiagen RNeasy Kit. RNA 
was quantified and 500 ng was used for cDNA synthesis (SuperScript 
VILO; 11754050, ThermoFisher) followed by quantitative (q)PCR. All 
Taqman probes were purchased from Applied Biosystems. All genes 
are expressed relative to 18S control probe and were run in triplicate. 


Mouse study 2, acute metformin and normal diet 

Ad libitum group. Male C57BL6/J mice (Charles River) were studied 
at 11 weeks old. Five-hundred milligrams of metformin was dissolved 
in 20 ml water to make a working stock of 25 mg mI. One hour after 
onset of light cycle, mice received a single dose by oral gavage of either 
metformin at 300 mg kg (Sigma, PHR1084-50OMG) or a matched vol- 
ume of vehicle (water). Weight (mean +s.e.m.) of control and treatment 
groups were 27.2 + 0.3 g and 26.7 + 0.2 g, respectively, on the day of 
study. After gavage, mice were returned to an individual cage and were 
euthanized at the relevant time point by terminal anaesthesia (Euthatal 
by intraperitoneal injection). Blood was collected into a Sarstedt Serum 
Gel 1.1 ml Micro Tube, left for 30 min at room temperature, then spun 
for 5 min at 10,000g at 40 °C before being frozen and stored at -80 °C 
until assayed. Mouse GDF15 levels were measured using a Mouse GDF15 
DuoSet ELISA (R&D Systems) which had been modified to run as an 
electrochemiluminescence assay on the Meso Scale Discovery assay 
platform. 


Fasted group. Mice, conditions and methods asin the ad libitum group, 
except that male mice were studied at 9 weeks old and 12 h before 
administration of metformin; mice and bedding were transferred to 
new cages with no food in the hopper. Weight (mean + s.e.m.) after 
fasting and on day of gavage were 22.3 + 0.5 g and 23.2+0.7 gfor control 
and treatment groups, respectively. 


Mouse study 3, metformin and high-fat diet, Gdf15“ and wild- 
type mice 

C57BL/6N-Gdf15tmla(KOMP)Wtsi/H mice (referred to as Gdf15“ mice) 
were obtained from the MRC Harwell Institute, which distributes these 
mice on behalf of the European Mouse Mutant Archive (https://www. 
infrafrontier.eu/). The MRC Harwell Institute is also a member of 
the International Mouse Phenotyping Consortium (IMPC) and has 
received funding from the MRC for generating and/or phenotyping 
the C57BL/6N-Gdf15tmla(KOMP) Wtsi/H mice. The research reported 
in this publication is solely the responsibility of the authors and does 
not necessarily represent the official views of the Medical Research 
Council. Associated primary phenotypic information may be found 
at https://www.mousephenotype.org/. Details of the alleles have been 
published*° 2. 

Experimental cohorts of male Gdf15” and wild-type mice were gen- 
erated by het x het breeding pairs. Mice were aged between 4.5 and 
6.5 months. One week before study start, mice were single-housed and 
three days before the first dose of metformin treatment, mice were 
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transferred from standard chow to 60% high-fat diet. On the day of first 
gavage, body weight of study groups (mean + s.e.m.) were 38.2+1.0g 
vs 38.8 + 0.6 g for wild-type vehicle and metformin treatments, respec- 
tively,and37.9+0.8 gvs37.0+1.4¢ for Gdf15” vehicle and metformin 
treatments, respectively. Each mouse received a daily gavage of either 
vehicle or metformin for 11 days, and their body weight and food intake 
was measured daily in the early part of the light cycle. One data point 
of 25 food intake points collected on day 11 of the study was lost owing 
to technical error (Gdf15** mouse, metformin). On day 11, mice were 
euthanized by terminal anaesthesia 4 h after gavage, and blood was 
obtained as in mouse study 2. Tissues were fresh frozen on dry ice and 
kept at -80 °C until the day of RNA extraction. 


Mouse study 4, metformin and high-fat diet, Gfral” mice 

Gfral’ mice ona mixed 129/SvEv-C57BL/6 background were purchased 
from Taconic (TF3754) and backcrossed for 10 generations to >99% 
C57BL/6 background at the NGM animal facility. Experimental cohorts 
were generated by het x het breeding pairs. Study design was as in study 
3, except that terminal blood was collected into EDTA-coated tubes. 


Mouse study 5, anti GFRAL antibody, metformin and high-fat 
diet 

Anti-GFRAL antibody generation. Anti-GFRAL monoclonal antibodies 
were generated by immunizing C57Bl/6 mice with recombinant puri- 
fied GFRAL ECD-hFc fusion protein, which was purified via sequential 
protein-A affinity and size exclusion chromatography techniques using 
MabSelect SuRe and Superdex 200 purification media, respectively 
(GE Healthcare), as described in patent US10174119B2 (https://patents. 
google.com/patent/US10174119B2/en). An in-house pTTS higK hlgG1 
expression vector was engineered to include the DEVDG (caspase-3) 
proteolytic site N-terminal to the Fc domain. The heavy chains of anti- 
GFRAL monoclonal antibodies were subcloned via EcoR1 and HindIll 
sites of an in-house engineered pTTS hlgK hIlgG1 caspase-cleavable 
vector. Light chains of anti-GFRAL monoclonal antibodies were also 
subcloned in the EcoR1 and HindIll sites in the pTTS hlgK hKappa vec- 
tor. The antibodies were transiently expressed in Expi293 cells (Thermo 
Fisher Scientific) transfected with the pTTS5 expression vector, and 
purified from conditioned media by sequential protein-A affinity and 
size-exclusion chromatography using MabSelect SuRe and Superdex 
200 purification media, respectively (GE Healthcare). All purified an- 
tibody material was verified endotoxin-free and formulated in PBS for 
in vitro and in vivo studies. Characterization of anti-GFRAL functional 
blocking antibodies was carried out using a cell-based RET/GFRAL lucif- 
erase gene reporter assays, in vitro binding studies (ELISA and Biacore) 
and in vivo studies, as described in patent number US10174119B2 
(https://patents.google.com/patent/US10174119B2/en). 

In all studies with anti-GFRAL, purified recombinant non-target- 
ing IgG on the same antibody framework was used as control. Met- 
formin was mixed with food paste made from the 60 kcal% fat diet 
(Research Diet no. D12492) using a food blender at a concentration 
to achieve an approximate consumption of 300 mg kg! metformin 
per day per mouse. Male mice were single-housed throughout and at 
the start of study period, body weight (mean + s.e.m.) was 43.7 +1.4g 
(vehicle + control IgG), 42.3 + 1.4 g (vehicle + anti-GFRAL), 41.9+11¢ 
(metformin + control IgG) and 43.3 + 1.3 g (metformin + anti-GFRAL). 
Recombinant antibodies were administered by subcutaneous injec- 
tion in the early part of the light cycle. Body composition (lean and 
fat mass) was analysed by ECHO MRI M113 mouse system (Echo Medi- 
cal Systems). The metabolic parameters oxygen consumption (VO,) 
and carbon dioxide production (VCO,) were measured by an indirect 
calorimetry system (LabMaster TSE System) in open-circuit sealed 
chambers. Measurements were performed for the dark (from 18:00 to 
06:00) or light (from 06:00 to 18:00) period under ad libitum feeding 
conditions. Mice were placed in individual metabolic cages and allowed 
to acclimate for a period of 24 h before data collection every 30 min. 


Finally, mice underwent a GTT. Mice were fasted for 6 h (07:00-13:00) 
inaclean cage. Blood samples (-30 pl) were collected as baseline before 
oral GTT. Mice were orally gavaged with1 g kg of 20% glucose solution 
with a dosing volume of 5 ml kg“. Blood samples were then collected 
through atail nick into K,-EDTA-coated tubes (SARSTEDT Microvette; 
no. 20.1278.100) at 15,30, 60 and 120 min after glucose challenge. Blood 
samples were centrifuged at 4 °C and the separated plasma was stored 
at-20 °C until used for plasma glucose and insulin assays. Glucose assay 
reagents were obtained from Wako (no. 439-90901) and the insulin 
ELISA kit was obtained from ALPCO (no. 80-INSMSU-E0]1). 


Mouse study 6, ITT after metformin and high-fat diet, Gdf15~ 
and wild-type mice 

Mice generation and protocol are the same as in study 3, except mice 
were aged 4 to 6 months. On the day of the first gavage, body weights 
(mean +s.e.m.) of study groups were 35.1 + 1.2 g (wild-type, vehicle), 
35.05 + 1.2 g (wild-type, metformin), 35.08 + 1.02 g (Gdf15”, vehicle) 
and 35.02 + 1.47 g (Gdf15”, metformin). On day 11, after the final dose 
of metformin, mice were fasted for 4 h. Baseline venous blood sample 
was collected into heparinised capillary tube for insulin measurement 
and blood glucose was measured using approximately 2 pI blood drops 
using a glucometer (AlphaTrak2; Abbot Laboratories) and glucose 
strips (AlphaTrak2 test 2 strips, Abbot Laboratories, Zoetis). Mice 
were given intraperitoneal injection of insulin (0.5 U kg“, Actrapid, 
NovoNordisk) and serial mouse glucose levels were measured at time 
points indicated. Mice were killed by terminal anaesthesia as in study 
2. Mouse insulin was measured using a Meso Scale Discovery two-plex 
mouse metabolic immunoassay kit according to the manufacturer’s 
instructions and using calibrators provided by Meso Scale Diagnostics. 
Serum metformin levels were quantified using a stable isotope dilution 
liquid chromatography-mass spectrometry (LC-MS/MS) method 
described previously”. 


Mouse study 7, GTT after metformin and high-fat-diet, Gdf15~ 
and wild-type mice 

Mice generation was as in study 3, except using female mice aged 3.5 to 
5.5months. Two groups of mice (Gdf15‘* and Gdf15“ littermates, body 
weight (mean + s.e.m.) 24.14 1.4 g and 24.3 + 1.3 g, respectively) were 
fed 60% HFD for two weeks. Each genotype was then further split into 
vehicle or metformin (300 mg kg”) treatment groups, given a single 
gavage dose at 08:00 and fasted for 6h. At time of GTT, body weights 
(mean +s.e.m.) of study groups were 26.4.1+1.5 g (wild-type, vehicle), 
26.5 +1.0 g (wild-type, metformin, 25.6 + 1.2 g (Gdf15“,, vehicle) and 
27.1£1.3 g (Gdf15“, metformin); one-way ANOVA, P= 0.8722. Baseline 
testing was as in mouse study 6. Mice then received a single dose of 
20% glucose intraperitoneally (2 mg g”) with serial measurement of 
glucose levels at time points indicated. Euthanasia and insulin analysis 
were performed as in mouse study 6. 


Mouse study 8, acute single high-dose metformin and high-fat 
diet 

Male C57BL6/J mice (Charles River) aged 14 weeks were switched 
from standard chow to 45% HFD (D12451i) for 1 week then 60% HFD 
(D12492i,) for 3 weeks. At the time of the study (18 weeks old) body 
weights (mean +s.e.m.) were 40.4 + 1.2 g and 41.1+ 1.3 g for the vehi- 
cle and metformin groups, respectively. Five-hundred milligrams of 
metformin (Sigma, PHRI084-5OOMG) were dissolved in 8.35 ml water 
to make a working stock of 60 mg mI“. Mice received a single dose by 
oral gavage of either 600 mg kg™ metformin or a matched volume of 
vehicle (water). They were returned to ad libitum 60% HFD and 6hlater 
blood was collected as in study 2. Tissue samples for RNA analysis were 
collected into Lysing Matrix D homogenization tubes (MP Biomedicals) 
on dry ice and stored at —80 °C until they were processed. Intestine 
between pylorus of stomach and caecum was laid out into three equal 
parts, with tissue taken from the midpoint of each third labelled as 


‘proximal’, ‘middle’ and ‘distal’ (adapted from ref. **). The colon section 
was from the midpoint between caecum and anus. Tissues for in situ 
hybridization were dissected and placed into 10% formalin/PBS for 
24 hat room temperature, transferred to 70% ethanol and processed 
into paraffin. Five-micrometre sections were cut and mounted onto 
Superfrost Plus (Thermo Fisher Scientific). Detection of mouse Gdf15 
was performed on formalin-fixed paraffin-embedded sections using 
Advanced Cell Diagnostics (ACD) RNAscope 2.5 LS Reagent Kit-RED (no. 
322150) and RNAscope LS 2.5 Probe Mm-Gdf15-O1 (no. 442948) (ACD). 
In brief, sections were baked for 1h at 60 °C before loading onto a Bond 
RX instrument (Leica Biosystems). Slides were deparaffinized and rehy- 
drated on board before pre-treatments using Epitope Retrieval Solution 
2 (no. AR9640, Leica Biosystems) at 95 °C for 15 min, and ACD Enzyme 
from the LS Reagent kit at 40 °C for 15 min. Probe hybridization and 
signal amplification was performed according to the manufacturer's 
instructions. Fast red detection of mouse Gdf15 was performed on 
the Bond RX using the Bond Polymer Refine Red Detection Kit (Leica 
Biosystems, no. DS9390) according to the ACD protocol. Slides were 
then counterstained with haematoxylin, removed from the Bond RX 
and were heated at 60 °C for 1h, dipped in xylene and mounted using 
EcoMount Mounting Medium (Biocare Medical, no. EM897L). 

Slides imaged onan automated slide-scanning microscope (Axioscan 
Z1and Hamamatsu orca flash 4.0 V3 camera) using a 20x objective with 
anumerical aperture of 0.8. Hybridization specificity was confirmed 
by the absence of staining in Gdf15“ mice. 

RNA extraction was carried out with approximately 100 mg of tis- 
sue in 1 ml Qiazol Lysis Reagent (Qiagen 79306l) using Lysing Matrix 
D homogenization tube and Fastprep 24 Homogenizer (MP Biomedi- 
cals) and Qiagen RNeasy Mini Kit (no. 74106) with DNasel treatment 
following manufacturers’ protocols. Five-hundred nanograms of 
RNA was used to generate cDNA using Promega M-MLV reverse tran- 
scriptase followed by TaqMan qPCR in triplicate for GDF15. Samples 
were normalized to Actb. TaqMan Probes: Mm00442228 m1 GDF1S, 
Mm02619580_g1 Act B, TaqMan; 2x universal PCR Master mix (Applied 
Biosystems Thermo Fisher, 4318157); QuantStudio 7 Flex Real time PCR 
system (Applied Biosystems Life Technologies). 


Mouse study 9, acute phenformin and normal diet, wild-type 
mice 

Male C57BL6/J mice aged 14 weeks with supplier, protocol and 
methodsas in study 2, except that phenformin (Sigma PHR1573, 500 mg) 
was used instead of metformin. 


Organoid studies 

Duodenal and ileal mouse organoid line generation, maintenance 
and 2D culture was performed as previously described*. CHOP-null 
mice were a gift from J. Goodall, with a line from Jackson Laboratory 
(B6.129S(Cg)-Ddit3tm2.1Dron/J, stock no. 005530). Human rectal orga- 
noids (experiments approved by the Research Ethics Committee under 
licence number 09/H0308/24) were generated from fresh surgical 
specimens (Tissue Bank, Addenbrooke’s Hospital (Cambridge, UK)) 
following a modified protocol*>**. In brief, rectal tissue was chopped 
into 5-mm fragments and incubated in30 mM EDTA for 3 x 10 min, with 
tissue shaken in PBS after each EDTA treatment to release intestinal 
crypts. The isolated crypts were then further digested using TrypLE 
(Life Technologies) for 5 min at 37 °C to generate small cell clusters. 
These were then seeded into basement membrane extract (BME, 
R&D Technology), with 20-1 domes polymerized in multiwell (48) 
dishes for 30-60 min at 37 °C. Organoid medium” was then overlaid 
and changed three times per week. Human organoids were passaged 
every 14-21 days using TrypLE digestion for 15 min at 37 °C, followed by 
mechanical shearing with rigorous pipetting to break organoids up into 
small clusters, which were then seeded as before in BME. For Transwell 
experiments, TrypLE-digested organoids were seeded onto Matrigel 
(Corning)-coated (2% for 60 min at 37 °C) polyethylene terephthalate 


cell culture inserts, pore size 0.3 pm (Falcon) in organoid medium 
supplemented with Y-27632 (R&D Technology). Organoids were 
observed through the transparent cell inserts to ensure 2D culture 
formation (allowing apical cell access for drug treatments). Medium 
was changed after 2 days and then switched on day 3 to a differentia- 
tion medium with Wnt3A-conditioned medium reduced to 10% and 
$B202190 or nicotinamide omitted from culture for 5 days. 

For GDF1S5 secretion experiments, 2D cultured organoid cells were 
treated for 24 h with indicated drugs, with medium then collected 
and GDF15 measured at the Core Biochemical Assay Laboratory (Cam- 
bridge) using the human or mouse GDF15 assay kit as outlined in the 
CAMERA human study and mouse study 2 above. 

RNA was extracted using TRI reagent (Sigma), with any contami- 
nated DNA eliminated using DNA-free removal kit (Invitrogen). Purified 
RNA was then reverse-transcribed using superscript II (Invitrogen) 
as per the manufacturer’s protocol. Quantitative PCR with reverse 
transcription was performed ona QuantStudio 7 (Applied Biosystems) 
using Fast Taqman mastermix and the following probes (Applied Biosys- 
tems): human GDF15 (Hs00171132_m1) and human ACTB (HsO1060665_ 
gl). Gene expression was measured relative to B-actin in the same 
sample using the AC, method, with fold (relative to control) shown 
for each experiment. 


Hepatocyte studies 

Primary mouse hepatocyte isolation and culture. Hepatocytes 
from 8 to 12-week-old C57B6] male mice were isolated by retrograde, 
non-recirculating in situ collagenase liver perfusion. In brief, livers 
were perfused with modified Hanks medium without calcium (8.0 g17 
NaCl, 0.4g1' KCI, 0.2 gl! MgSO,.7H,0, 0.12 g I Na,HPO,.2H,O, 0.12 
gI™KH,PO,, 3 gl" Hepes, 0.342 gl" EGTA and 0.05 g I") followed by 
digestion with perfusion medium supplemented with calcium (0.585 
gl" CaCl,.2H,O) and 0.5 mg mI collagenase IV (Sigma, C5138). The 
digested liver was removed and washed using chilled DMEM:F12 (Sigma) 
medium containing 2 mML-glutamine, 10% FBS, 1% penicillin/strepto- 
mycin (Invitrogen). Viable cells were collected by Percoll (Sigma) gra- 
dient. The final pellet was resuspended in the same DMEM:F12 media. 
Cell viability was greater than 90%. Hepatocytes were plated onto 
primaria plates (Corning). Hepatocytes were allowed to recover and 
attach for 4-6 h before replacement of the medium overnight before 
stress treatments the following day for the times and concentrations 
indicated. 


Generation and culture of iPSC-derived human hepatocytes. The 
human induced pluripotent (iPS) cell line AIATDR/R used in this work 
was derived as previously described”’** under approval by the regional 
research ethics committee (reference number 08/HO311/201). iPS cells 
were maintained in Essential 8 chemically defined media” supplement- 
ed with2ng mI" Tef-B (R&D) and 25 ng mI FGF2 (R&D), and cultured on 
plates coated with 10 pg mI Vitronectin XFTM (STEMCELL Technolo- 
gies). Colonies were regularly passaged by short-term incubation with 
0.5mM EDTA in PBS. For hepatocyte differentiation, colonies were dis- 
sociated into single cells following incubation with StemPro Accutase 
Cell Dissociation Reagent (Gibco) for 5 min at 37 °C. Single cell suspen- 
sions were seeded on plates coated with 10 pgm" Vitronectin XFTM 
(STEMCELL Technologies) in maintenance media supplemented with 
10 uM ROCK Inhibitor Y-27632 (Selleckchem) and grown for upto 72h 
before differentiation. Hepatocytes were differentiated as previously 
reported*®, with minor modifications as listed. In brief, following endo- 
derm differentiation, anterior foregut specification was achieved after 
5 days of culture with RPMI-B27 differentiation media supplemented 
with 50 ng mI ‘Activin A (R&D)*°. Foregut cells were further differenti- 
ated into hepatocytes with HepatoZ YME-SFM (Gibco) supplemented 
with 2 mM L-glutamine (Gibco), 1% penicillin-streptomycin (Gibco), 
2% non-essential amino acids (Gibco), 2% chemically defined lipids 
(Gibco), 14 pg mI‘ of insulin (Roche), 30 pg mI of transferrin (Roche), 
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50ng ml hepatocyte growth factor (R&D), and 20 ng mI oncostatin 
M (R&D), for up to 27 days. 


Cellular studies on ISR 


Chemicals and reagents. Tunicamycin and ISRIB were purchased 
from Sigma-Aldrich. Metformin and Phenformin was purchased from 
Cayman Chemicals and GSK2606414 from Calbiochem. The antibody 
for GDF15 and CHOP (sc-7351) were obtained from Santa Cruz. Phos- 
pho S51 EIF20 (ab32157) and Calnexin (ab75801) were purchased from 
Abcam. D. Ron provided the antibody for ATF4. 

Eukaryotic cell lines and treatments. Mouse embryonic fibroblast 
(MEF) cell lines were obtained from D. Ron (CIMR/IMS, Cambridge) and 
maintained as previously described'®. MEFs were transfected with 30 
nM control siRNA or asmartpool on-target plus siRNA for mouse Chop 
(Dharmacon L-062068-00-0005) using Lipofectamine RNAi MAX (Inv- 
itrogen) according to the manufacturer’s instruction. 48 h post siRNA 
transfection, cells were processed for RNA and protein expression 
analysis. All cells were maintained at 37 °C ina humidified atmosphere 
of 5% CO, and seeded onto 6- or 12-well plates before stress treat- 
ments for the times and concentrations indicated. Vehicle treatments 
(for example, DMSO) were used for control cells when appropriate. 


RNA isolation, cDNA synthesis and qPCR. Following treatments, cells 
were lysed with Buffer RLT (Qiagen) containing 1% 2-mercaptoethanol 
and processed through a Qiashredder with total RNA extracted us- 
ing the RNeasy isolation kit according to manufacturer’s instructions 
(Qiagen). RNA concentration and quality was determined by Nanodrop. 
400 ng-500 ng of total RNA was treated with DNasel (Thermofisher Sci- 
entific) and then converted to cDNA using MMLV Reverse Transcriptase 
with random primers (Promega). Quantitative RT-PCR was carried out 
with either TaqMan Universal PCR Master Mix or SYBR Green PCR mas- 
ter mix on the QuantStudio 7 Flex Real time PCR system (Applied Biosys- 
tems). All reactions were carried out in either duplicate or triplicate and 
C, values were obtained. Relative differences in gene expression were 
normalized to the expression levels of the housekeeping genes HPRT 
or GAPDH for cell analysis, using the standard curve method. Primers 
used for this study: mouse Gdf15 (Mm00442228 m1, ThermoFisher 
Scientific), human GDFI5 (HsO0171132_m1, ThermoFisher Scientific), 
human GAPDH (Hs02758991 g1, ThermoFisher Scientific), mouse 
Hprt (forward AGCCTAAGATGAGCGCAAGT, reverse GGCCACAGGACTAG 
AACACC). 


Immunoblotting. Following treatments, cells were washed twice with 
ice-cold D-PBS and proteins collected using RIPA buffer supplemented 
with cOmplete protease and PhosStop inhibitors (Sigma). The lysates 
were cleared by centrifugation at 13,000 rpm for 15 min at 4 °C, and 
protein concentration determined by a Bio-Rad DC protein assay. Typi- 
cally, 20-30 pg of protein lysates were denatured in NuPAGE 4 x LDS 
sample buffer and resolved on NuPage 4-12% Bis-Tris gels (Invitrogen) 
and the proteins were transferred by iBlot (Invitrogen) onto nitrocel- 
lulose membranes. The membranes were blocked with 5% non-fat dry 
milk or 5% BSA (Sigma) for 1h at room temperature and incubated 
with the antibodies described in the reagents section. Following a16 
hincubation at 4 °C, all membranes were washed five times in Tris- 
buffered saline/O0.1% Tween-20 before incubation with horseradish 
peroxidase (HRP)-conjugated anti-rabbit immunoglobulin G (IgG) or 
HRP-conjugated anti-mouse IgG (Cell Signalling Technologies). The 
bands were visualized using Immobilon Western Chemiluminescent 
HRP Substrate (Millipore). Allimages were acquired onthe ImageQuant 
LAS 4000 (GE Healthcare). 


Statistical analyses 

CAMERA data were analysed using a mixed linear model with restricted 
maximum likelihood to investigate the metformin effect on GDF15. 
This is analogous to conducting a repeated measures ANOVA, but is 


amore flexible analysis and allows for missing observations within 
subjects. The 0-18 months difference in weight and GDF15 correlation 
was tested using Spearman’s coefficient. CAMERA data were analysed 
using STATA v.15.1. 

Other statistical analyses were performed using Prism 7 and Prism 
8, using unpaired two-tailed t-tests, or two-way ANOVA, with multiple 
comparison adjustment by Tukey’s or Sidak’s test. Metabolic rate was 
determined using ANCOVA with energy expenditure as the dependent 
variable, body weight as a covariate and treatment as a fixed factor. 
ANCOVA and analyses of glucose and ITT in mice were performed using 
SPSS 25 (IBM). 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Source Data for Figs. 1-4 and Extended Data Figs. 1-6, 8-10 are provided 
with the paper. Other data that support the findings of this study are 
available from the corresponding authors upon request. The CAMERA 
trial dataset is held at the University of Glasgow and is available on 
request from the investigators subject to a signed agreement operating 
within the confines of the original ethics application. 
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Extended Data Fig. 1| Expanded CAMERA dataset. a, Linear association 
between change in body weight and change in plasma GDF15 between 0 and18 
months among metformin-treated participants (n=74, Spearman correlation 
r=-0.26, two-sided P= 0.024). The red line is the linear regression slope, and 
grey area is 95% Cl for the slope. b, Absolute and relative differences in plasma 
GDF15 concentration between metformin and placebo groups at eachtime 
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Extended Data Fig. 2| Effect of single oral dose of metformin in chow-fed 
mice. Serum GDF15 levels in male mice measured 2, 4 or 8 hafter a single gavage 
dose of metformin (300 mg kg”). a, Mice fed ad libitum overnight before 
gavage. b, Mice fasted for 12 h before gavage. Data are mean +s.e.m. (a;n=6per 
group, b;n=4 per group); Pvalues by two-way ANOVA with Tukey’s correction 


for multiple comparisons. 
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Extended Data Fig. 3| Body weight changes with metformin treatment in 
mice with disrupted GDF15-GFRAL signalling. a, Absolute body weight in 
Gdf15‘* and Gdf15“ mice ona high-fat diet treated with metformin 

(300 mg kg“ day‘) for 11 days, mice as in Fig. 2a. Dataare mean+s.e.m., 


Pvalues by two-way ANOVA with Tukey’s correction for multiple comparisons. 
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mice dosed with an anti-GFRAL antagonist antibody or with control IgG weekly 
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correction for multiple comparisons. 
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Extended Data Fig. 4| Response of high-fat diet-fed Gdf15” and Gfral’” mice 
to metformin. a, Circulating GDF15 levels in high-fat diet-fed Gdf15* and 
Gdf15“ mice given oral dose of metformin (300 mg kg”) once daily for 11 days. 
Dataare mean+s.e.m., mice asin Fig. 2a. All Gdf15” samples were below lower 
limit of the assay (<2 pg mI”); Pvalues by two-way ANOVA with Tukey’s 
correction for multiple comparisons. b, Circulating GDF15 levels in high-fat 
diet-fed Gfral’” and Gfral“ mice given oral dose of metformin (300 mg kg”) 
once daily for 11 days. Data are mean+s.e.m., mice asin Fig. 2c; Pvalues by two- 
way ANOVA with Tukey’s correction for multiple comparisons. c, Cumulative 
food intake in high-fat diet fed Gfral”* and Gfral” mice ona high-fat diet given 
an oral dose of metformin (300 mg kg”) once daily for 11 days. Data are 
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mean +s.e.m., miceas in Fig. 2c; no statistically significant difference in vehicle 
versus metformin by two-way ANOVA. d, Fat mass (left) and lean mass (right) in 
metformin-treated obese mice dosed with anti-GFRAL antagonist antibody 
weekly for five weeks, starting four weeks after initial metformin exposure 
(mice as in Fig. 2d). Body composition was measured using MRI after 4 weeks of 
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metformin exposure and 2 weeks after receiving anti-GFRAL (week 6) and after 
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multiple comparisons. 
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Extended Data Fig. 5 | Response of second, independent cohort of high-fat 
diet fed Gdf15“* and Gdf15” mice to metformin. a—c, Percentage change in 
body weight (a), absolute body weight (b) and cumulative food intake (c) of 
Gdf15""* and Gdf15" mice ona high-fat diet treated with metformin 

(300 mg kg day) for 11 days. Data are mean +s.e.m. (n= 6 per group, except 
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Extended Data Fig. 6 | Glucose, insulin and GDF15 response to metformin. 

a, Fasting glucose from oral GTT as in Fig. 3e, f. ANOVA; effect of antibody, 
P=0.028; effect of metformin, P= 0.271; interaction of antibody and 
metformin, P=0.707.b, Circulating GDF15 in mice undergoing intraperitoneal 
GTT after a single dose of metformin as in Fig. 3k, |. Pvalues by two-way ANOVA 
with Tukey’s correction for multiple comparisons. c, d, Fasting glucose (c) and 
fasting insulin (d) at time 0 of intraperitoneal GTT as in Fig. 3k, I; not 
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statistically significant by two-way ANOVA. e, AUC analysis of glucose levels as 
in Fig. 3k, 1. Pvalues by two-way ANOVA, effect of genotype, P=0.392; 
interaction of genotype and metformin, P= 0.883. a-e, Data all mean+s.e.m. 
f, Circulating GDF15 levels in high-fat-diet-fed Gdf15"" mice after single oral 
dose of metformin (600 mg kg”). Samples were collected 6 h after dosing, data 
are mean+s.e.m.,n=7 per group; Pvalues (95% Cl) by two-tailed t-test. 
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Gdf15 mRNA expression (red spots) in colon. Tissue collected from high-fat- 
diet-fed wild-type mice, 6 hafter a single dose of oral metformin (600 mg kg”) 
(right, red box, M1-M7) or vehicle gavage (left, blue box, V1-V7); n=7 mice per 


Extended Data Fig. 7 | In situ hybrididation for Gdf15 mRNA expressionin 
gut, liver and kidney. a, Representative images from the mouse with 
circulating GDF15 level closest to the group median shown in Fig. 4b, with 
images from other regions of the gut and from liver. b, In situ hybridization for group, miceas in Fig. 4. 
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Extended Data Fig. 8 | Analysis of Gdf15 mRNA expression (normalized to expression levels of Actb) in tissue from high-fat diet-fed Gdf15"* mice. Metformin 
treatment (300 mg kg”) once daily for 11 days (see Fig. 2a). Data are mean +s.e.m.,n=6 metformin, n=7 vehicle; Pvalues (95% Cl) by two-tailed f-test. 
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Extended Data Fig. 9 | Hepatic GDF15 response to biguanides. a, b,Gdf15 
mRNA expression in primary mouse hepatocytes (a) or human iPS-cell-derived 
hepatocytes (b) treated with vehicle control (Con) or metformin for 6h. mRNA 
expression is presented as fold expression relative to control treatment (set at 
1), normalized to Hprt and GAPDH in mouse and human cells, respectively. 
Data are expressed as mean +s.e.m. from four (a) or two (b) independent 
experiments. Pvalues (95% CI) by one-way ANOVA with Tukey’s correction for 
multiple comparisons. c, d, Circulating levels of GDF15 (c) and hepatic Gdf15 
mRNA expression (d) (normalized to B2-microglobulin) in chow-fed, wild-type 
mice 4 hafter a single oral dose of phenformin (300 mgkg’). Dataare 
mean+s.e.m.,n=6 per group; Pvalues (95% Cl) by two tailed t-test. 

e, Representative image of in situ hybridization for Gdf15 mRNA expression 
(red spots) of fixed liver tissue derived from animals treated as described 
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Extended Data Fig. 10 | Role of the ISR in biguanide-induced Gdf15 
expression. a, b, mRNA levels in kidney (a) and colon (b) isolated from obese 
mice 24h after a single oral dose of metformin (600 mg kg”). Dataare 

mean +s.e.m. (n=5 per group, except forn=4 for colon metformin Slc22al). 
Pvalues (95% Cl) by two-tailed t-test. Gdf15 mRNA fold induction 24 h after 
metformin (600 mg kg‘) is positively correlated with Chop mRNA inductionin 
both kidney (a, right) and colon (b, right). Black line shows linear regression 
analysis. c-g, Immunoblot analysis of ISR components (c) and Gdf15 mRNA 
expression (d) in wild-type mouse embryonic fibroblasts (MEF) treated with 
vehicle control (Con), metformin (Met, 2mM) or phenformin (Phen, 5 mM) or 
tunicamycin (Tn, 5 pg mI”, used as a positive control) for 6h. e-g, Gdf15 mRNA 
expression in ATF4 knockout (KO) MEFs (e), incontrol siRNA and CHOP siRNA 
transfected wild-type MEFs treated with Tn or Phen for 6h (f), or in wild-type 
MEFs pre-treated for 1h with either the PERK inhibitor GSK2606414 (GSK, 

200 nM) or elF2a inhibitor ISRIB (ISR, 100 nM) thenco-treated with 
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phenformin fora further 6h (g). mRNA expression is presented as fold 
expression relative to its respective control treatment (set at 1) or phenformin- 
treated samples (set as 100) with normalization to Hprt gene expression. Data 
are mean +s.e.m. from two (c, d) or at least three (e-g) independent 
experiments. Pvalues (95% Cl) by two tailed t-test relative to phenformin- 
treated control wild-type and control siRNA-treated samples. h, GDF15 protein 
in supernatant of mouse derived 2D duodenal organoids treated with 
metformin in the absence or presence of ISRIB (1 1M). Data are expressed as 
mean +s.e.m. from two independent experiments. At least duplicate protein 
measurements for each sample. Pvalues by two-way ANOVA with Sidak’s 
correction for multiple comparisons. i, GDF15 protein in supernatants of 
mouse-derived 2D duodenal organoids from wild-type and Chop-null mice 
treated with metformin from two independent experiments. At least duplicate 
protein measurements for each sample. Data are mean +s.e.m.; Pvalues 

(95% Cl) by two-tailed t-test. 
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Sample size Sample sizes were determined on the basis of homogeneity and consistency of characteristics in the selected models and were sufficient to 
detect statistically significant differences in body weight, food intake and serum parameters between groups, while also ensuring no more 
animals than necessary were used. These assessments are based on extensive expertise with these models and endpoints. In vitro sample 
sizes were based on previous extensive experience with reagents and systems used. 


Data exclusions One data point of 25 food intake points collected on day 11 of mouse study 3 was lost due to technical error (see methods). One animal in 
mouse study 6 which was otherwise well and healthy did not progress to ITT after metformin due to technical and acute behavioural issues 
at time of study. One blood sample from mouse study 7 was lost due to technical error. 


Replication Acute dosing of metformin to chow fed animals has been replicated and reproduced across two laboratories. Phenfomin data has been 
replicated. Mouse study 1 has not been replicated exactly but the acute response to metformin in HFD fed animals has been reproduced 
across two laboratories. Longer term dosing studies of metformin to HFD Gfral and Gdf15 null mice (mouse study 3 and 4) have been 
replicated; two independent cohorts of Gdf15 null mice appear in the manuscript. ITT post metformin dosing of Gdf15 null has been done 
once. Mouse study 5 has not been replicated exactly but similar data were generated using anti-GFRAL antibody given over a different time 
scale but with similar metformin exposure. Mouse studies7 and 8 have been done once. All in vitro and cell based experiments have been 
replicated successfully and reliably reproduced with replicate numbers reported in legends. 


Randomization |= Animals were randomised into treatment groups based on body weight such that the mean body weights of each group were as matched as 
possible but without using excess numbers of animals. 


Blinding Serum/plasma GDF15 measurements were blinded by the investigators. Investigator undertaking ISH analysis of tissue was blinded to 
treatment during tissue processing and labelling. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


Human research participants 


Clinical data 


Antibodies 


Antibodies used ATF4 antibody was obtained from David Ron. CHOP was obtained from Santa Cruz (Cat# sc-7351; RRID: AB_627411). pEIF2a 
Ser51 was obtained from Abcam (Cat# ab32157; RRID: AB_732117). Calnexin was obtained from Abcam (Cat# ab75801; RRID: 
AB_1310022). Anti-GFRAL functional blocking antibody generated by NGM.Secondary antibodies used were horseradish 
peroxidase (HRP)-conjugated anti-rabbit immunoglobulin G (lgG), HRP-conjugated anti-mouse IgG (Cell Signalling Technologies) 


Validation ATF4 and CHOP antibody has been previously validated in KO and siRNA knockdown samples (this study and PMID: 30639358). 
pEIF2a antibody has been validated previously (PMID: 27297692) and we have independent confirmed this during the course of 
our studies for PMID: 30639358 (data not shown). Calnexin is a well established commercially available antibody that has been 
used by numerous investigators and published frequently. Characterization of anti-GFRAL functional blocking antibodies was 
carried out using a cell-based RET/GFRAL luciferase gene reporter assays, in vitro binding studies (ELISA and Biacore) and in vivo 
studies as described in patent number; US10174119B2, https://patents.google.com/patent/US10174119B2/en. 


= 
je’) 
a 
= 
=s 
a) 
= 
a) 
Wn 
oO 
je) 
= 
a 
=i 
= 
io) 
12) 
©) 
= 
=} 
© 
Wn 
(Ee 
3 
fev) 
= 
<= 


Eukaryotic cell lines 


Policy information about cell lines 


Cell line source(s) Mouse Embryonic Fibroblasts (MEFs) was obtained David Ron (CIMR). The hiPSC line ALATDR/R was obtained from Ludovic 
Vallier (Cambridge Stem Cell Institute). 


Authentication MEFs have been previously validated (PMID 12667446 and 12667446) and the hIPS cell line validated (PMID 20739751 and 
21993621) 
Mycoplasma contamination MEFs and A1ATDR/R cells were tested negative for Mycoplasma contamination 


Commonly misidentified lines no commonly misidentified lines were used. 
(See ICLAC register) 
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Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Mice studied ranged between 2 to 6 months of age. At NGM, animals were kept under controlled light (12hour light and 12hour 
dark cycle, dark 6:30 pm - 6:30 am), temperature (22 + 3°C) and humidity (50% + 20%) conditions. In Cambridge, were 
maintained in a 12-hour light/12-hour dark cycle (lights on 0700-1900), temperature-controlled (22°C) facility, with ad libitum 
access to food (RM3(E) Expanded chow, Special Diets Services, UK) and water. Any mice bought from an outside supplier were 
acclimatised in a holding room for at least one week prior to study. Study diets as outlined in relevant section of "Methods". 
Age , sex and body weight of mice used as detailed in relevant "Methods" section. Unless otherwise stated, male mice were 
used a danimals housed singly during studies with environmental enrichment within cages. 

All drug administration and testing performed during the light cycle. 

Wild type mice used in Cambridge (C57BL6/J mice ) from Charles River, Margate, UK. 
C57BL/6N-Gdf15tmia(KOMP)Wtsi/H mice were obtained from the MRC Harwell Institute, UK. 

Gfral-/- mice were purchased from Taconic (#TF3754) on a mixed 129/SvEv-C57BL/6 background and backcrossed for 10 
generations to >99% C57BL/6 background at NGM’s animal facility. 


Wild animals The study did not involve wild animals 
Field-collected samples The study did not involve samples collected from the field. 
Ethics oversight At NGM, all experiments were conducted with NGM IACUC approved protocols and all relevant ethical regulations were 


complied with throughout the course of the studies, including efforts to reduce the number of animals used. 

In Cambridge, all mouse studies were performed in accordance with UK Home Office Legislation regulated under the Animals 
(Scientific Procedures) Act 1986 Amendment, Regulations 2012, following ethical review by the University of Cambridge Animal 
Welfare and Ethical Review Body (AWERB). 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Human research participants 


Policy information about studies involving human research participants 


Population characteristics CAMERA was a randomized, double-blinded, placebo-controlled trial designed to investigate the effect of metformin on 
surrogate markers of cardiovascular disease in patients without diabetes, aged 35 to 75, with established coronary heart disease 
and a large waist circumference (2 94cm in men, 280 cm in women) . Mean age (yrs)+/-SD; metformin 63(8), placebo 64(8); male 
sex , metformin 79(81%), placebo 63( 72%).Baseline characteristics did not differ substantially between treatment groups. 

Nine participants completed the study by Konopka and colleagues; 7 had a family history of T2DM and 8 were metformin naive. 
One participant had previously used metformin but discontinued more than 2 years before the study commenced. 


Recruitment 3000 potential participants were identified from electronic searches of Glasgow general practice databases, supplemented by 
patients from hospital cardiology clinics. Of those invited, 805 replied and 356 were screened. 173 were enrolled and randomly 
assigned (86 to metformin, 87 to placebo).Key eligibility criteria included the use of statin therapy, history of coronary heart 
disease, large waist circumference (94cm in men, 280cm in women), and no history or biochemical evidence of type 2 diabetes. 
Eligible participants were randomly assigned to metformin or placebo (1:1) with a randomisation sequence generated 
independently by computer with permuted blocks of four without stratification. 

In the study by Konopka and colleagues, inclusion criteria were: obesity (body mass index >30 kg/m2), sedentary (<1 hour of 
structured activity per week), nonsmoking, and not taking any medication to control blood glucose. 


Ethics oversight All participants in CAMERA study provided written informed consent and were followed up for 18 months. This study was 
approved by the Medicines and Healthcare Products Regulatory Agency and West Glasgow Research Ethics Committee, and 
done in accordance with the principles of the Declaration of Helsinki and good clinical practice guidelines. The study by Konopka 
and colleagues was approved by the Mayo Clinic Institutional Review Board and all participants provided written, informed 
consent . 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Clinical data 


Policy information about clinical studies 
All manuscripts should comply with the ICMJE guidelines for publication of clinical research and a completed CONSORT checklist must be included with all submissions. 


Clinical trial registration The CAMERA trial is registered with ClinicalTrials.gov, number NCT00723307. The study by Konopka and colleagues is registered, 
number NCT01956929. 


Study protocol Trial protocol outlined in PMID: 24622715 and supplied on submission. 


Data collection Participants were randomized 1:1 to 850mg metformin or matched placebo twice daily with meals. Follow up study visits were 
conducted between 2009 and 2012 at the Glasgow Clinical Research Centre. Participants attended six monthly visits after 
overnight fasts and before taking their morning dose of metformin. Blood samples collected during the trial were centrifuged at 
4 degrees Celsius soon after sampling, separated and stored at -80°C. Bodyweight, body fat (by bio-impedance with a Tanita BIA 
body fat analyser [Tanita Corporation, Tokyo, Japan]), waist circumference (measured midway between lowest rib and iliac 
crest), and hip circumference (measured around widest part of the buttocks) were measured at each visit. 

In the study by Konopka and colleagues, placebo or metformin (week 1, 500mg twice daily; week, 2 1000mg twice daily) were 
administered following a six week period of washout. Samples were collected in the morning after overnight fasting. 
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Outcomes The primary endpoint was progression of mean distal carotid intima-media thickness (cIMT) over 18 months in the modified 
intention-to-treat population. Further descriptions and results for the pre-specified outcomes for the CAMERA trial are provided 
in a previous publication PMID: 24622715. 
As outlined in a previous publication (PMID: 27160898) the study by Konopka and colleagues aimed toi nvestigated whether 
metformin inhibited glucagon-stimulated endogenous glucose production (EGP) in humans. The study measured EGP using 
stable isotope methodology under basal, glucagon-deficient, and glucagon-stimulated conditions. 
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The solid tumour microenvironment includes nerve fibres that arise from the 
peripheral nervous system’. Recent work indicates that newly formed adrenergic 
nerve fibres promote tumour growth, but the origin of these nerves and the 
mechanism of their inception are unknown’”. Here, by comparing the transcriptomes 
of cancer-associated trigeminal sensory neurons with those of endogenous neurons 
in mouse models of oral cancer, we identified an adrenergic differentiation signature. 
We show that loss of 7P53 leads to adrenergic transdifferentiation of tumour- 
associated sensory nerves through loss of the microRNA miR-34a. Tumour growth 
was inhibited by sensory denervation or pharmacological blockade of adrenergic 
receptors, but not by chemical sympathectomy of pre-existing adrenergic nerves. 
Aretrospective analysis of samples from oral cancer revealed that p53 status was 
associated with nerve density, which was in turn associated with poor clinical 
outcomes. This crosstalk between cancer cells and neurons represents mechanism by 
which tumour-associated neurons are reprogrammed towards an adrenergic 
phenotype that can stimulate tumour progression, and is a potential target for 


anticancer therapy. 


Early in cancer development, nerve fibres form and infiltrate tumour 
tissue, and the density of these nerves in solid tumours has been 
associated with poor clinical outcomes”. Neurogenic signals, includ- 
ing adrenergic stimulation by tumour-infiltrating neural fibres, are 
strongly associated with tumorigenesis, angiogenesis, invasion, and 
metastasis’. Hence, the little-understood molecular mechanisms 
of cancer-nerve crosstalk during cancer-associated neural infiltra- 
tion represents opportunities for therapeutic intervention. The 7P53 
tumour suppressor gene is the most commonly mutated gene in 
head and neck cancer and shapes multiple aspects of tumour forma- 
tion, including the microenvironment*®. p53 expression fluctuates 
during nerve regeneration, permitting tight control of the plastic- 
ity of the differentiation phase®’. We therefore explored the mecha- 
nisms of cancer-nerve interactions in head and neck squamous cell 
carcinoma on the basis of the hypothesis that the p53 protein 
suppresses cancer-nerve interactions in this disease and that the loss 


of this p53 function increases cancer-nerve crosstalk and thereby 
promotes tumour progression. 


Loss of p53 alters neural milieu 

To evaluate the impact of tumour innervation in head and neck cancer, 
we analysed survival data from The Cancer Genome Atlas. High neural 
density in oral cavity squamous cell carcinoma (OCSCC) was associated 
with poorer overall survival (P < 0.0001, log-rank test) and with the 
presence of TP53 mutations (P < 0.0001, Extended Data Fig. la-c). 
To study the interplay between nerves and epithelial cell p53 func- 
tion throughout tumorigenesis, we evaluated neuritogenesis over the 
course of the progression from precursor lesions to high-grade lesions 
in Trp53°"* wild-type control and Krt5“°Trp59’°"™ (Trp53” inthe 
recombined epithelial tissue) mouse models of oral cancer’ (Fig. 1a). 
Nerve density was significantly greater in tumours from mice lacking 
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Fig. 1|Loss of p53 alters the neural microenvironment throughout 

tumour evolution. a, Tumour progression in Krt5.Cre;B6.129P2-Trp53°"""™4 
(KrtS“°Trp53f°") and B6.129P2-Trps3e™ (Trp53f*) control mice. 

b, Quantification of neural density in tongues from 7rp53°" and 
Krt5“Trp53"'« mice immediately after the end of treatment with the 
carcinogen 4-nitroquinoline 1-oxide (4NQO) (normal mucosa), 20 weeks after 
treatment completion (low-grade lesions), and 30 weeks after treatment 
completion (high-grade lesions) (n=8 except for control group at 20 weeks 
(n=6)).c, Representative immunofluorescence staining of DRGs co-cultured 
with p53" or p53™" PCI-13 cells; data independently replicated in 12 ganglia. 

d, Quantification of neuritogenesis in DRGs co-cultured with p53-isogenic PCI- 
13 cells or normal oral keratinocytes (n= 6 biologically independent ganglia per 
cell line). e, Analysis of neural density in orthotopic p53-isogenic PCI-13 
xenograft cells (n=8 mice per group). f, Representative immunofluorescence 
of in vitro neuritogenesis in DRGs treated with soluble factors (EV-depleted 
conditioned medium) or EVs from p53-isogenic PCI-13 cells; data 
independently replicated in 20 ganglia. g, In vitro quantification of 
neuritogenesis (n= 4 biologically independent ganglia per condition). 

h, Quantification of neuritogenesis in freshly collected DRG cultured with 
conditioned medium from HN31RAB27A/RAB27B-isogenic human OCSCC cells 
(n=8andn=5 biologically independent ganglia for HN31 RAB27A’ RAB27B 
and HN31 RAB27A**RAB27B™* cells respectively). i, Representative 
immunofluorescence montage of glossectomy specimens derived from mice 3 
weeks after injection of HN31 RAB27A/RAB27B-isogenic OCSCC cells; data 
independently replicated in 10 mice.j, In vivo analyses of neural density (n=5 
mice per group). Unpaired two-tailed t-test (h,j) and one-way ANOVA with 
Tukey multiple comparisons (b, d, e, g). Bar graphs represent mean+s.e.m. 


p53 expression in oral epithelia compared with control mice expressing 
p53 (Fig. 1b), in both early and advanced phases of cancer development. 
Similarly, in orthotopic xenografts of human OCSCC cells, we found 
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increased nerve density p53-deficient (p53™") tumours compared 
with wild-type (p53) controls (Extended Data Fig. 1d-f). These find- 
ings suggest that the loss of p53 signalling in epithelial tumour cells is 
associated with neuritogenesis during early tumorigenesis. 

To better understand the mechanisms that link epithelium-derived 
signalling, altered p53 function, and neuritogenesis, we assessed 
induced neurite outgrowth from dorsal root ganglia (DRG) co-cultured 
with oral keratinocytes and p53-isogenic OCSCC cells. Neurite out- 
growth was markedly enhanced in the presence of p53™" cells compared 
with p53“"-expressing cells (Fig. 1c, d). The p53°”" mutation (which 
causes a functional p53 deficiency) was also associated with increased 
neuritogenesis, as was the p53°** mutation (which leads to structural 
p53 deficiency); however, the partial effect of p53°*> on the DNA- 
binding region (which causes partial functional p53 deficiency) only 
modestly increased neuritogenesis’. The increased neuritogenesis 
seen in tumours with p53™ or either DNA-contact or structural muta- 
tions in 7P53, bothin vitro and in vivo (Fig. le, Extended Data Fig. 1g-l), 
suggested that the effect of p53 mutations on neuritogenesis is likely 
to result from loss of p53 function. 


Cancer-derived vesicles control neuritogenesis 


To investigate the route by which neurotrophic cues are delivered to 
nerves, we incubated DRGs with cancer-derived soluble and extracel- 
lular vesicle (EV) compartments of conditioned medium from p53-iso- 
genic OCSCC cell cultures (Extended Data Fig. 2a-g). DRGs cultured 
with EVs derived from p53™" cells had more neurofilaments than those 
cultured with the corresponding EV-depleted conditioned medium or 
with EVs from p53™' cells (Fig. 1f, g). 

Toassess the potential effect of EVs on neuritogenesis, we used con- 
ditioned medium derived from HN31 cells (a human OCSCC cell line 
with endogenous p53“ and p53“"5 mutations that cause functional 
deficiency) and isogenic HN31 RAB27A’ RAB27B~ cells lacking the 
GTPases RAB27A and RAB27B, which are necessary for the exocytosis 
of EVs’° (Extended Data Fig. 2h, i). Knockout of RAB27A and RAB27B 
significantly reduced DRG neuritogenesis in vitro (Fig. 1h, Extended 
Data Fig. 2j, k) and in vivo in orthotopic xenografts of OCSCC cells 
(Fig. 1i,j). Thus, neurons appear to be a major stromal target of EVs 
derived from p53-deficient cancer cells. 


Cancer-derived miRNAs drive axonogenic switch 


Small RNAs are key regulators of neuronal development, regeneration, 
and function”, and EVs are the main route by which RNA species are 
transferred between cells (Extended Data Fig. 3a, b). By comparing the 
microRNA (miRNA) profiles of EVs derived from p53-sufficient and 
isogenic p53-deficient cell lines, we identified p53-related alterations 
in the expression of 17 miRNAs (Extended Data Fig. 3c-e). A custom- 
ized single-channel Agilent array revealed a similar significant increase 
in the expression of miR-34a-5p and miR-141-5p, but not the other 
15 miRNAs, in EVs derived from p53“ cells compared with those from 
p53™" cells (GEO accession number GSE140324). Analysis of miR-34a 
and miR-141in tongue tumour specimens from 7rp53”"™ control mice 
and Krt5“° Trp53/°"™ mice confirmed a significant decrease in the levels 
of both miRNAs in 7rp53” tumours (Extended Data Fig. 3f). 

To investigate how miR-34a and miR-141 interact with other cancer- 
derived EV signals to mediate axonogenesis, we co-cultured trigeminal 
ganglion (TG) neurons transfected with antagomiR-34a or antago- 
miR-141 (Extended Data Fig. 3g) with EVs containing abundant miR-34a 
and miR-141 derived from p53“' OCSCC cells. Compared with non- 
specific antagomiR or no-EV controls, both antagomiRs increased 
the number of neurofilaments, although only antagomiR-34a had 
a significant effect (Fig. 2a, b). Conversely, transfection of TG neu- 
rons with ectopic miR-34a (compared with scramble miRNA or no-EV 
controls), followed by co-culture with miR-34a-deficient EVs derived 
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Fig. 2 | p53-dependent alterations in miRNA populations control 
neuritogenesis. a, Representative fluorescence-bright-field overlay images of 
TGneurons after transfection with antagomiR-34a and incubation with EVs 
from p53™' PCI-13 cells EVs (left) and after transfection with miR-34a mimic and 
incubation with EVs from p53™" PCI-13 cells (right); data independently 
replicated in12 wells. b, Quantification of neuritogenesis 72 h after neuron-EV 
co-culture. EV-free medium (grey bars) or non-specific antagomiR/miR-mimic 
transfection were used as controls (n= 3 biologically independent samples per 
condition). c, Quantification of neuritogenesis in TG neurons 72 hafter 
neuron-EV co-culture using EVs from p53" OCSCC cells stably expressing a 
short hairpin targeting miR-34a (shmiR-34a) or non-specific control (n=4 
biologically independent samples per condition). d, Representative images 
showing B3-tubulin-positive neural fibres in orthotopic OCSCC xenografts 
stably expressing shmiR-34a (n =8 mice) or non-specific control (n=7 mice); 
data independently replicated in15 mice. e, Quantification of neural density 

3 weeks after orthotopic injection as ind. Bar graphs represent mean+s.e.m. 
Unpaired two-tailed t-test (c, e) or one-way ANOVA with Tukey multiple 
comparisons (b). 


from p53™" OCSCC cells, significantly decreased the number of neuro- 
filaments (Fig. 2a, b, Extended Data Fig. 3h). Co-culture of TG neurons 
with EVs derived from miR-34a-deficient p53“ OCSCC cells” (Extended 
Data Fig. 3i-k) significantly increased the number of neurofilaments 
in vitro compared with no EVs or EVs derived fromthe same cells with- 
out knockdown (Fig. 2c). Knockdown of miR-34a in p53“ OCSCC cells 
also increased neuritogenesis in vivo, compared with the same cells 
without knockdown (Fig. 2d, e). Together, these data support the notion 
that EV shuttling of OCSCC-derived miR-34a to cancer-associated 
neurons negatively regulates, and confers resistance to, EV-derived 
axonogenic signals. 

To dissect the relative contributions to axonogenesis of stimulatory 
and inhibitory signals in cancer-derived EVs, we used antagomiRs to 
selectively inhibit 13 miRNAs detected by our EV miRNA sequencing 
that have previously been shown to induce neural growth or cancer- 
ous neural invasion”* ”. Inhibition of miR-21, miR-197, or miR-324, but 
not of the other miRNAs, reduced neurite outgrowth in TG neurons 
co-cultured with EVs derived from miR-34.a-deficient p53™" OCSCCs 
(Extended Data Fig. 4a). Transfection of TG neurons with miR-21, miR- 
197, or miR-324 alone only modestly increased neuritogenesis; however, 
co-transfection with both miR-21 and miR-324, with or without miR-197, 
increased neuritogenesis twofold (Extended Data Fig. 4b, c). Incubation 
of TGneurons with liposomes containing miR-21, miR-324, and scramble 
miRNA increased axonogenesis compared with liposomes contain- 
ing miR-34a, miR-21 and miR-324 together (Extended Data Fig. 4d-f). 
These data are consistent with aberrant, but orchestrated, signalling 
by miRNAs in OCSCC-derived EVs that both negatively regulates and 
acts as a ligand that modulates the tumour neural microenvironment. 


EVs drive sensory nerve reprogramming 


To evaluate the effect of OCSCC-derived EVs on neuronal transcrip- 
tional programs, we performed RNA sequencing of human DRG neurons 
co-cultured with EVs from p53™" or p53" OCSCC cells®. Principal com- 
ponent analysis of RNA sequencing data showed that neurons cultured 
with EVs from p53“ OCSCC cells were segregated from those cultured 
with EVs from p53™" OCSCC cells (Fig. 3a). Ingenuity pathway analysis 
revealed that the differentially expressed genes in the latter group of 
neurons were enriched for neuronal outgrowth and morphogenesis, 
synaptogenesis, differentiation and stemness, and synaptic trans- 
mission (Fig. 3b). To evaluate the effect of p53 on cancer-associated 
neuronal differentiation, we analysed nerve fibre densities in samples 
of tissue removed during glossectomy (7P53*", n=12; TP53™, n=12) 
from patients with OCSCC treated at MD Anderson Cancer Center. 
Assessment of the sympathetic and parasympathetic branches of the 
autonomic nervous system showed that fibres positive for tyrosine 
hydroxylase (TH, adrenergic), but not those positive for vesicular ace- 
tylcholine transporter (parasympathetic), were significantly denser 
in TP53™ OCSCCs than in TP53’ tumours (Fig. 3c, d, Extended Data 
Fig. 5a-e). Similarly, in vivo, TH’ nerve fibres were significantly denser 
in tongue tumour specimens from 7rp53” tumours excised from 
Krt5“Trp53"°"™ mice than in those from Trp53”"* tumours (wild-type 
Trp53) excised from Trp53”°“™ control mice (Fig. 3e). Together these 
results suggest that the recruitment of proximal adrenergic neurons 
is related to signals that originate from p53-deficient tumours. 

Next, we investigated whether shuttling of tumour-derived miRNAs 
in EVs regulates the differentiation of cancer-associated neurons. Neu- 
ritogenesis and noradrenaline release were increased in human DRG 
or mouse TG sensory neurons 72 h after incubation with EVs derived 
from p53™" OCSCC cells, but not from p53" OCSCC cells (Fig. 3f-h, 
Extended Data Fig. 5f-k). Next, we tested whether the adrenergic neu- 
rogenic response of cancer-associated neurons to epithelial loss of 
p53 could be rescued by EVs derived from p53" OCSCC cells in vivo. 
Daily intratumoral injections of EVs from p53™' OCSCC cells markedly 
inhibited noradrenaline secretion and TH expression in orthotopic 
xenografts of p53™ OCSCC cells, compared with tumours injected with 
p53™" EVs and controls (Fig. 3i-k, Extended Data Fig. 51). These results 
provide evidence that EVs derived from p53" OCSCC cells suppress 
neo-adrenergic cancer-associated neurogenesis. 

miR-34a restricts cell fate and impedes somatic cell reprogramming, 
whereas miR-34a deficiency expands cell developmental potential’? ”. 
We tested the ability of miR-34a to suppress the tumour-associated 
phenotypic switch. Both TH expression and noradrenaline levels (Fig. 31, 
m) were higher in TG neurons incubated with EVs purified from p53" 
OCSCC cells in which miR-34a was knocked down than in control TG 
neurons incubated with EVs from p53" OCSCC cells in which miR-34a 
was not knocked down. Furthermore, incubation of TG neurons with 
liposomes containing miR-21, miR-324, and scramble miRNA, but not 
with liposomes containing miR-21, miR-324, and miR-34a, resulted in 
robust noradrenaline synthesis (Extended Data Fig. 5m). 

Next we investigated the transcriptomes of p53-deficient cancer- 
associated neurons. We identified 2,495 genes that were upregulated 
and 1,760 that were downregulated in neurons incubated with EVs 
from p53™" OCSCC cells, compared with neurons incubated with EVs 
from p53" OCSCC cells (Extended Data Fig. 5n; GEO accession number 
GSE140189). Upregulated genes were associated with neuronal sur- 
vival, development, growth, and branching; downregulated genes were 
associated with neuronal function and synaptic transmission (Gene 
Ontology terms)”. Neurons incubated with EVs from p53™" OCSCC 
cells had high expression of catecholamine biosynthesis-related genes 
(Extended Data Fig. 6a) and low expression of endogenous sensory 
neuron pain signalling genes including Nérk2, Tacl and Plcg1; potassium 
channel family genes; and glutamate metabotropic receptor genes 
(Prkc and Pka). These analyses suggest that the TG sensory neurons 
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Fig. 3 | p53-deficient tumours are enriched with adrenergic nerve fibres. 
a, b, Heat map (a) and enriched Gene Ontology terms (b) for differentially 
expressed genes in freshly collected human DRG neurons incubated with EVs 
from p53™' or p53™" PCI-13 cells, plotted by fold enrichment with the 
associated log Pvalue (Fisher’s exact algorithm for functional gene set 
enrichment); n=3 biologically independent samples per condition. 

c, Representative images showing TH’ adrenergic neural fibres in TP53”’ or 
TP53™* human OCSCC tissue; data independently replicated in 24 patient 
specimens. d, Quantification of TH’ areas as inc (n=12 independent 

samples per group). e, Quantification of TH’ neural fibres in tumours from 
Krt5“Trp53"'« and control mice (n=5). f, Representative images of neo- 
neurites (B3-tubulin*) inhuman DRG co-cultured with EVs from p53™' or p53™" 


acquired transcriptional programs similar to those of sympathetic 
neurons. 

To better understand the functional annotations of the involved 
miRNAs, we compared the transcriptional profiles of TG neurons 
transfected with miR-21, miR-34a, or miR-324 with that of TG neurons 
transfected with scramble miRNAs (Extended Data Fig. 6b). Neural 
identity determination domains (Gene Ontology terms) were signifi- 
cantly enriched in transcription factors crucial for neural cell fate and 
catecholaminergic differentiation, including En1, Lrp6, Ryk, Shh, Fzd3, 
Erbb2, and WntSa (induced by miR-21), and Nrp2, Gdnf, and Sema3f 
(induced by miR-324). Two determinants of sensory neuron differentia- 
tion, Nérk1 and Isl1, were downregulated by miR-21 and miR-324, respec- 
tively. Expression of genes involved in fundamental inhibitory neuron 
differentiation pathways, including Eyal, Sox9, Homez, Cic, KdmSa, 
and Eif4ebp3, was dysregulated by miR-34a. Neural growth domains 
were enriched in genes associated with neurogenesis, axonogenesis, 
and neuron projection development pathways. The axon guidance 
genes, most of which were deregulated by miR-21, included members 
of the neural cell adhesion family (Cdh2, Nrcam, Lamcl, and Nfasc), 
semaphorins (Plxna1, Plxna2, Sema3b, Sema3d, and Sema3e), ephrins 
(Efnb2, Epha3, Ephb4, and Epha4), Rho GTPases (Srgap2), and Napa, 
Slit3, and Slitrk6. The top pathways annotated in the neuron function 
domain were mostly related to synaptic functions, ion channels, and 
neurotransmission. Glutamatergic signalling genes (for example, 
Cnih2, Ntrk2, Oprm1, and Syt1) and nociception genes (Ptgs1, Tacl, and 
Npyir) were also downregulated by miR-324, whereas key components 
of catecholaminergic neuron maintenance, including Nr4a2, En1,Lrp6, 
and Ryk, were upregulated by miR-21. 

To investigate the origin of the TH* neo-nerves that infiltrated 
OCSCC xenografts, we ablated adrenergic nerves by injecting mice 
with 6-hydroxydopamine (6-OHDA) before inoculating them with 
tumour cells (Fig. 4a). Quantification of TH* fibres showed that there 
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PCI-13 cells; data independently replicated in 8 wells. g, h, Quantification of 
neuritogenesis (g) and noradrenaline levels (h) inhuman DRGsas inf (n=4 
biologically independent samples per condition). i-k, PCI-13-p53™" orthotopic 
tumours were injected daily with no EVs (vehicle) or with EVs from p53™" or 
p53™' PCI-13 cells for 3 weeks (n=5 mice per group, i), and TH* neural areas (j) 
and noradrenaline levels (k) inthe tongue were measured. I, m, TG neurons 
were co-cultured with EVs from p53" OCSCC cells treated with lentiviral miR- 
34a or non-specific inhibitors. Quantification of TH* TG neurons (n=4 mice per 
condition, I), and noradrenaline levels (n =3 biologically independent samples 
per condition, m). Bars represent mean + s.e.m. Unpaired two-tailed f-test 
(d,e, g,h,1,m) or one-way ANOVA with Tukey multiple comparisons (j,k). 


were similar increases in the numbers of cancer-associated TH’ fibres in 
both 6-OHDA-treated mice and controls (Fig. 4b, c). In6-OHDA-treated 
mice, the levels of noradrenaline were significantly higher in p53"™" 
tumours than in p53“ tumours (P< 0.001 for each, pairwise compari- 
sons, Fig. 4d). Flowcytometry analysis of TG neurons in 6-OHDA-treated 
mice showed a substantial increase in TH’ neurons after implantation 
of p53™" tumour cells (Extended Data Fig. 7a). To investigate whether 
differentiation signals modify the neural signalling components in 
the tumour microenvironment, we surgically cut the lingual nerve, a 
branch of the trigeminal nerve, thereby ablating sensory innervation 
to the tongue, before orthotopic injection of tumour cells into BALB/c 
(nu/nu) mice. Surgical lingual denervation markedly decreased the 
cancer-associated TH* area (P< 0.0001, Fig. 4e, f), the number of TH* 
ipsilateral TG neurons (Extended Data Fig. 7b, c), and tumour devel- 
opment (Fig. 4g) relative to those in sham-operated mice. However, 
chemical sympathectomy before orthotopic xenograft implantation 
of OCSCC cells did not affect tumour growth (Fig. 4h). These results 
indicate that tumour-derived signals regulate the adrenergic differ- 
entiation of cancer-associated nerves and that these neo-adrenergic 
nerves, rather than infiltration of pre-existing adrenergic nerves, 
promote tumour growth. 

In nude mice orthotopically xenografted with p53™" OCSCC cells, 
the number of TH* cell bodies was markedly increased compared with 
tumour-free mice or mice orthotopically xenografted with p53" 
OCSCC cells (Extended Data Fig. 7d, e). Furthermore, inoculation of 
mice with p53" OCSCC cells in which miR-34a was knocked down 
increased TH expression and noradrenaline secretion; these increases, 
along with tumour volume, were abrogated by lingual denervation 
(Extended Data Fig. 7f-i). We further explored the cancer-induced 
retrograde reprogramming of TG neurons in vivo. We identified 685 
genes that were upregulated and 15 that were downregulated in TG 
neurons from mice orthotopically injected with p53™" OCSCC cells, 


a 
6-OHDA Tissue removal 
WvV \ Cell i lati | 
er eee tiayon > Perivascular TH* fibres 
ie) if. 30 days > Cancer-associated TH* fibres 
b No tumour control 


p53"! xenograft 


S 
S 
c d 8 
15 P <0.0001 "40 S 
S 1.0 & 30 8 os 
o & i u 
o Qa 
5 £ 20 g % | 
+ oO - 
- "2 ® T No tumour 
8 ue & BB p53" xenograft 
0 2 eel HB p53™" xenograft 
0 
p53" xenograft — - + + § é 3s é 3s 
6-OHDA  - + - + ee é Ss & aS 
I © Perivascular TH* fibres « e Ss & 
ll = Cancer-associated TH’ fibres w Ros oe 
9g h 


Lingual denervation Tissue removal — Sham surgery 


40 40 


V Cell inoculation = — Lingual denervation = — Vehicle 
E £ — 6- = 
6 7 26 days E 30 ee 6-OHDA P=0.92 
o gz © 
: g& 
= 20 co 2 
1 8 $ u 
3 “3 
_ £ 10 £ 
€ 10 5 2 
g 0 
a 0 7 14 19 0 if. 14 23 
+ 05 ° Days after inoculation Days after inoculation 
= fil bap rsitea aber s soe RE 
p53! xenograft p53"! xenograft 
0 
N 
S&S » « Mle Perivascular TH* fibres 
so SS" Ble Cancer-associated TH" fibres 
Cece 
wo 


Fig. 4| Denovotransdifferentiated cancer-associated adrenergic nerves 
support tumour growth. a, BALB/c (nu/nu) mice were chemically 
sympathectomized by intraperitoneal injection of 6-OHDA and then injected 
orthotopically with p53™" PCI-13 cells. Tumour volume was monitored for 3 
weeks. b, Illustrative immunohistochemical analysis for TH’ neural fibres in 
tongues from mice with or without tumours and with or without 
sympathectomy. Blue arrowheads, pre-existing TH* perivascular neural fibres; 
red arrowheads, non-perivascular TH’ fibres that emerged after tumour 
formation (cancer-associated nerves); data independently replicated in 32 
mice. c, Perivascular and cancer-associated TH* area (n= 8 mice per condition). 
d, Noradrenaline levels in tumour-bearing tongue (ipsilateral to tumour 
injection site), adjacent normal tongue (contralateral to tumour injection site), 
and normal tongue controls (n=3).e, BALB/c (nu/nu) mice were orthotopically 
xenografted with p53™" PCI-13 cells to the denervated or sham-operated 
tongue. f, Quantification of TH’ nerve fibres in mice from e (n=8 mice per 
condition). g, h, In vivo p53™" PCI-13 tumour growth after lingual denervation 
(blue, n=12) compared with sham surgery controls (n=10, g), or chemical 
sympathectomy (red, n=12) compared with vehicle controls (n=11,h). Bar 
graphs and tumour growth curves represent mean + s.e.m. Unpaired two-tailed 
t-test (g, h) or one-way ANOVA with Tukey multiple comparisons (c, d, f). 


compared with mice injected with p53“" OCSCC cells (Extended Data 
Fig. 8a). Ingenuity pathway analysis indicated significant activation of 
embryonic stem cell pluripotency canonical pathways (Extended Data 
Fig. 8a). We also found statistically significant activation of adrenergic 
signalling pathways, axonogenesis, neurite branching, and ephrin 
axonal guidance signalling. 

To elucidate the reprogramming of neuronal identity”*”’, we profiled 
the expression of known neuronal lineage differentiation transcription 
factors (Fig. 5a, Extended Data Fig. 8b-e) in mouse TG after orthotopic 
inoculation of mice with p53-deficient or p53-sufficient OCSCC cell 
lines. POUSF1 and KLF4 not only have essential roles in differentiation 
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Fig. 5| Retrograde signalling by p53-deficient but not p53-sufficient cancer 
cells activates neural reprogramming. a, Representative images 
demonstrating expression of POUSF1, KLF4, and ASCL1in ipsilateral TGs after 
orthotopic injection of p53" (upper) or p53™" (lower) PCI-13 cells; data 
independently replicated in six mice. b, Kaplan-Meier curves showing the 
overall survival of patients with high and low TH’ adrenergic nerve densities. 
Two-sided log-rank test. 


but also are validated direct targets of miR-34a”, alongside strong 
candidate targets such as NEUROG2 and ASCLI, which are crucial for 
the determination of neuronal identity”””®. The transcriptional factors 
POUSF1 and KLF4, which are sufficient to reprogram mouse adult neural 
stem cells**”3°, showed increased expression in mouse TG neurons 
after inoculation with p53™" OCSCC cells (Fig. 5a). In line with the neo- 
neuron adrenergic phenotype, the number of neurons expressing 
ASCLI1, which is essential for proper development of the sympathetic 
nervous system”! 33, was elevated in mouse TG after inoculation with 
p53™" but not p53" OCSCC cells (Fig. 5a). 


Adrenergic innervation promotes OCSCC growth 

To evaluate the effect of selective adrenergic receptor blockade on 
OCSCC progression, we orthotopically injected p53-deficient OCSCC 
cells into BALB/c (nu/nu) mice treated with carvedilol, anon-selective 
blocker of B,, B,,and a, adrenergic receptors. Tumours from carvedilol- 
treated mice exhibited lower growth rates and proliferation (Ki-67°) 
indices than did tumours from vehicle-treated mice, with similar car- 
diovascular haemodynamics (Extended Data Fig. 8f-h, Supplementary 
Table 1). These data support the notion that adrenergic neuron-derived 
signals have an important role in OCSCC progression. In a validation 
set of patients treated at MD Anderson Cancer Center (n= 70, Supple- 
mentary Table 2), Kaplan-Meier analysis revealed that increased TH* 
nerve density (Fig. 5b, Extended Data Fig. 8i) in OCSCCs was associ- 
ated with lower recurrence-free survival rates (P= 0.00103, log-rank 
test) and lower overall survival rates (P< 0.0001, log-rank test). The 
statistical significance of the association with adrenergic nerve fibre 
densities was sustained in multivariable analysis after adjustment for 
clinical variables, including age, sex, pathologic stage, surgical mar- 
gin status, perineural invasion presence, and treatment modalities 
(Supplementary Table 3). These results suggest that nerve density 
assessment merits exploration as an independent predictive marker 
of oral cancer aggressiveness. 


Discussion 

Neural regulation represents an emerging targetable pathway for the 
treatment of cancer!** °°, The peripheral adrenergic nervous system 
has previously been shown to regulate prostate cancer tumorigen- 
esis’?**>, Sympathetic nerves form a dominant part of the normal 
prostate microenvironment, while in the oral cavity, their presence is 
modest and limited to the perivascular space. Our present study reveals 
that the emergence of adrenergic neonerves in the tumour micro- 
environment accompanies the initial phase of OCSCC development. 
Incontrast to previous findings using a prostate cancer mouse model, in 
our OCSCC mouse model, ablation of the sympathetic nervous system 
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before tumour inoculation neither abrogated the development of 
adrenergic neo-nerves nor inhibited tumour growth’. 

Neither the origin of adrenergic neo-nerves nor the cellular and 
molecular events that control their development throughout tumo- 
rigenesis have previously been characterized. A recent study did show 
that doublecortin-expressing neural progenitors from the central 
nervous system infiltrate prostate tumours and metastases”. Here, 
we have identified crosstalk between the peripheral nervous system 
and head and neck tumours and described a phenotypic switch, induced 
by cancer cells, in which sensory nerves differentiate into adrenergic 
neo-neurons. Our findings show that in p53-deficient tumours, an 
miRNA-based mechanism mediates neuronal responses to environ- 
mental cues and determines the fate of cancer-associated neurons. 
We have shown that axonal sprouting and autonomic reprogramming 
of existing nerves occur asa result of miRNA shuttling from cancer cells 
to neurons. These miRNAs orchestrate gene expression via combined 
dominantly negative (for example, miR-34a) and positive (for example, 
miR-21 and miR-324) effects, activating transcriptional programs that 
establish neuronal identity. In our mouse model of OCSCC, surgical 
ablation of sensory nerves prevented the development of these adr- 
energic neo-nerves. Our results thus show that the peripheral sensory 
nerves may be reprogrammed during the development of cancer ina 
manner similar to that of neural progenitors that initiate adrenergic 
neurogenesis during tumour formation. 

As tumours evolve, neo-neural networks develop in and around the 
tumour stroma, providing signals that coordinate cancer progression’”. 
These results are consistent with recent preclinical data suggesting 
that sympathetic fibres accumulate in the normal vicinity of solid 
tumour tissues and infiltrate into the stroma. Furthermore, clinical 
data showthat cancer patients treated with B-blockade have improved 
survival, supporting the role of adrenergic nerve activity in cancer 
progression®””, Although further studies will be required to dissect the 
molecular events that link tumour-associated neuritogenesis to cancer 
progression, our data raise the tantalizing possibility that drugs that 
target both axonal growth and the adrenergic nervous system could 
be useful for the treatment of head and neck cancer. 
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Methods 


Animals and in vivo procedures 

B6.129P2-Trp53™ 4 (Trp53/°"*) and BALB/c nu/nu (B6.Cg-FoxnI"™*”) 
mice were obtained from The Jackson Laboratory. Krt5“° mice were 
obtained from Dr. Carlos Caulin*®. All animal studies were carried 
out according to protocols approved by The University of Texas MD 
Anderson Cancer Center Institutional Animal Care and Use Commit- 
tee. Mouse housing, husbandry, and care practices met or exceeded 
the minimum requirements set forth in the Animal Welfare Act and the 
Guide for the Care and Use of Laboratory Animals (8th Edition). Disease 
development and progression were closely monitored, and mice with 
metastases were euthanized as soon as we noticed signs of discomfort 
in the mice or when the largest dimension of atumour reached 5mm, 
according to our approved protocol. Innone of the experiments were 
these limits exceeded. 

Orthotopic human tumours were implanted by injection of 5 x 10+ 
PCI-13 cells suspended in 30 pl serum-free medium (see Supplementary 
Methods) into the lateral tongue of 6- to 8-week-old BALB/c nu/nu mice 
and were monitored three times per week as previously described”. 
For the transgenic models, we crossbred Trp53”°"™ and Krt5“ mice 
to generate male and female Krt5“°Trp53”°""* mice; the recombined 
epithelial tissue of these mice lacks p53. We added 4NQO (100 pg/ 
ml) to the drinking water (1% sucrose) of Krt5“°Trp53°"™ mice and 
Trp5F° controls*®*o", After 8 weeks of 4NQO treatment, mice were 
killed at 0, 20, and 30 weeks to study normal, precursor, and malignant 
lesioninnervation, respectively. All animals underwent a full oral cavity 
examination three times per week and were killed for tissue retrieval 
4 weeks after tumour cell injection. 

For the assessment of in vivo neural recruitment, mice were anaes- 
thetized and prepared: the right chorda-lingual nerve was exposed in 
the neck and transected between the anterior belly of the digastric and 
masseter muscles*. Although the proximal and distal stumps of the 
transected chorda-lingual nerve were separated, we resected a5-mm 
section of each stump to minimize regeneration. 


CRISPR-Cas9 knockout 

For generation of RAB27A and RAB27B knockout cells, two synthetic 
single-guide RNAs (sgRNAs) in complex with Cas9, targeting the pro- 
tein coding sequence of either RAB27A (guide sequence: GTC GTT 
AAG CTA CGA AAC CT, exon 5) or RAB27B (guide sequence: TGA ACG 
GCA AGC TCG GGA AC, exon 5), were transfected by electroporation. 
sgRNAs and Cas9 2NLS Nuclease were purchased from Synthego, and 
electroporation was performed using the Cell Line Nucleofector Kit V 
(Lonza). A total of 100,000 cells were diluted in 50 pl electroporation 
buffer containing 3.6 uM of eachsgRNA and 0.8 tM Cas9 enzyme, and 
electroporated using the program P-020 (Amaxa Nucleofector Il). Cells 
were transferred to full medium immediately after electroporation and 
left to recover for 1 week, and then single-cell colonies were generated 
by seeding of a single cell using flow cytometry (sorting for live cells 
by propidium iodide staining) at the South Campus Flow Cytometry 
and Cellular Imaging Core Facility at MD Anderson Cancer Center. 
Screening for knockout clones was done by Sanger sequencing of target 
regions and western blotting. 


Isolation, quantification, and characterization of EVs 

Conditioned media were collected, and EVs were isolated by differ- 
ential centrifugation and analysed using NanoSight, as previously 
described®. In brief, conditioned media were centrifuged at 300g 
for 10 min to eliminate cells and at 2,000g and 10,000g to eliminate 
dead cells and cell debris, respectively. Then, EVs were pelleted by 
ultracentrifugation at 120,000g for 70 min and subsequently washed 
with PBS at a similar speed. The number and size of EVs were deter- 
mined as previously described using NanoSight analysis”. In brief, EVs 
were analysed using a NanoSight LM10 Nanoparticle Characterization 


system. All nanoparticle tracking analyses were carried out with iden- 
tical experiment settings. Particles were measured for 60 s, and for 
optimal results, microvesicle concentrations were adjusted to obtain 
about 50 microvesicles per field of view. EV morphology was assessed 
by transmission electron microscopy as previously described“. 

For transmission electron microscopy, samples were placed on 
100 mesh carbon-coated, formvar-coated copper grids treated with 
poly-L-lysine for approximately 1h. Samples were then negatively 
stained with Millipore-filtered aqueous 1% uranyl acetate for 1 min. 
Stains were blotted dry from the grids with filter paper, and samples 
were allowed to dry. Samples were then examined in aJEM 1010 trans- 
mission electron microscope (JEOL, USA, Inc.) at an accelerating voltage 
of 80 Kv. Digital images were obtained using the AMT Imaging System 
(Advanced Microscopy Techniques Corp.). 

EVs were lysed in a 2% sodium dodecyl sulphate buffer, and equal 
amounts of protein were loaded onto a sodium dodecyl sulphate- 
polyacrylamide gel and transferred onto polyvinylidine difluoride 
membranes (Bio-Rad Laboratories). Antibodies against CD63 were 
used as primary antibodies. As secondary antibodies, horserad- 
ish peroxidase-linked antibodies against rabbit immunoglobulin G 
(GE Healthcare) were used at a dilution of 1:5,000. Bound antibodies 
were visualized by chemiluminescence. 


Neuron mRNA sequencing and data analysis 
Low-input RNA libraries compatible with Illumina were prepared using 
the Smart-Seq V4 Ultra Low Input RNA (Clontech) and KAPA HyperPlus 
Library Preparation kits. In brief, full-length, double-stranded cDNA 
was generated from 10 ng total RNA using Clontech’s SMART (Switch- 
ing Mechanism at 5’ End of RNA Template) technology. The full-length 
double-stranded cDNA was amplified by eight cycles of long-distance 
PCR, then purified using AMPure Beads (Agencourt). Following bead 
elution, the cDNA was evaluated for size distribution and quantity 
using the 4200 TapeStation High Sensitivity DNA Kit (Agilent Technolo- 
gies) and the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific), 
respectively. The cDNA was enzymatically fragmented, and 5 ng of the 
fragmented cDNA was used to generate IIlumina-compatible libraries 
using the KAPA HyperPlus Library Preparation kit. The KAPA libraries 
were purified and enriched with eight cycles of PCR to create the final 
cDNAlibrary. The libraries were quantified using the Qubit dsDNA HS 
Assay (Thermo Fisher Scientific), then multiplexed with 12-16 libraries 
per pool. The pooled libraries were quantified by quantitative PCR using 
the KAPA Library Quantification Kit (KAPA Biosystems) and assessed for 
size distribution using the 4200 TapeStation (Agilent Technologies). 
The libraries were then sequenced, one pool per lane, on the Illumina 
HiSeq4000 sequencer using the 76-bp paired-end format. 
Paired-end reads in FASTQ format were initially checked for read 
quality using FastQC* and thenaligned to the reference genome, UCSC 
mouse mm10 or GENCODE human GRCh38, using TopHat2*. Alignment 
quality was evaluated from the output of TopHat2. The BAM file with 
mapped reads for each sample was sorted using SAMtools”, which 
serves as an input for HTSeq** to estimate the number of reads that 
are mapped to each gene. The read counts for all samples were then 
normalized using the trimmed mean of M method implemented in 
the R Bioconductor package edgeR”®. Weakly expressed genes were 
excluded if they did not have more than one read per millionin at least 
two samples (adjusting for library size for each sample). Principal com- 
ponent analysis and hierarchical cluster analysis using Pearson distance 
and Ward’s minimum variance method were used to evaluate sample 
quality and similarity. The generalized linear model likelihood ratio 
test implemented in the edgeR package” was applied to determine 
significant differentially expressed genes between groups. Benjamini-— 
Hochberg correction was applied to the P values for multiple testing 
adjustment. Significant differentially expressed genes were selected 
using a false discovery rate cutoff of 5% with or without an absolute 
fold-change of >2 for heatmap generation. The Pearson distance and 
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Ward’s minimum variance method were used to generate the cluster 
dendrograms on the heatmaps. 


EV miRNA sequencing and data analysis 

miRNA libraries compatible with Illumina were prepared using the 
QlAseq miRNA Library Kit (QIAGEN), per the manufacturer’s protocol. 
In brief, 80 ng of total RNA was sequentially ligated to adapters. Inthe 
first ligation step, an adenylated 3’ DNA adaptor was ligated to the 3’ 
end of the miRNA. In the second step of the ligation, an RNA adaptor 
was ligated to the 5’ end of the mature miRNAs in the sample. ACDNA 
library was then synthesized from the mature miRNAs using a reverse 
transcription primer containing an integrated unique molecular bar- 
code. During reverse transcription, the reverse transcription primer 
hybridized to the 3’ adaptor and converted the dual 3’/5’ ligated miRNAS 
to cDNAs, while adding a unique molecular barcode and a universal 
sample index to every miRNA molecule. Following bead cleanup, the 
libraries were enriched and unique sample indexes were added using 
16 cycles of PCR. A post-PCR bead cleanup was performed, and then the 
libraries were assessed for size distribution using the Tapestation (Agi- 
lent Technologies), quantified using the Qubit assay (Thermo Fisher 
Scientific), and then multiplexed, 33 samples per pool. The pooled 
library was quantified by quantitative PCR using the KAPA Library 
Quantification Kit (KAPA Biosystems), then sequenced, one pool per 
run, on the Illumina NextSeq 500 Sequencer using the 75-nt SR High- 
Output flow cell. 

Single-end reads in FASTQ format were initially checked for read 
quality using FastQC*. The QIAseq miRNA Library Kit specific 3’ adap- 
tor sequence was trimmed from the 3’ ends of sequencing reads using 
cutadapt™. The trimmed reads were then aligned to the reference 
genome, human GRCh38, using BWA™. SAMtools flagstat was used to 
check the mapping quality*’. The SAM files with mapped reads for each 
sample were sorted by coordinate and outputted in BAM format using 
the Picard tool of SortSam (http://broadinstitute.github.io/picard/). 
The miRNA GFF annotation file was downloaded from miRBase, which 
serves as an input for featureCounts® to estimate the number of reads 
that are mapped to each mature miRNA. Weakly expressed miRNAs 
were filtered out if they did not have more than one read in at least two 
samples. The read counts for all samples were then normalized using 
the trimmed mean of M method implemented in the R Bioconductor 
package edgeR®. Principal component analysis and hierarchical cluster 
analysis using Pearson distance and Ward’s minimum variance method 
were used to evaluate sample quality and similarity. The generalized 
linear model likelihood ratio test implemented in the edgeR package”’ 
was applied to determine significant differentially expressed miRNAs 
between groups. The Benjamini—Hochberg correction was applied to 
the P values for multiple testing adjustment. Significant differentially 
expressed miRNAs were selected using a false discovery rate cutoff of 
5% for each comparison. The Pearson distance and Ward’s minimum 
variance method were used to generate the cluster dendrograms on 
the heatmaps. 


Affymetrix gene expression microarray and analysis 

Total RNA was collected to assess global gene changes after TG 
neurons were transfected with miR-21, miR-34a, miR-324, or scram- 
ble miRNA. The array was analysed using the Affymetrix Clariom S 
mouse assay. The CEL files generated were processed through Tran- 
scriptome Analysis Console version 4.0 (Thermo Fisher Scientific), 
which normalized (and applied the log, function to) array signals 
using arobust multiarray averaging algorithm. Differential expres- 
sion between neurons transfected with different miRNAs was defined 
as a fold-change in the absolute value that was equal to or greater to 
1.1and a P value obtained from the moderated t-statistic from the 
limma package that was less than 0.05. The gene-level differential 
expression analysis was performed using Transcriptome Analysis 
Console version 4.0. Pathway analysis and functional annotation for 


upregulated and downregulated genes in the three comparisons were 
performed using enrichR (https://amp.pharm.mssm.edu/Enrichr/). 
Dysregulated genes from the pathways of interest were displayed 
using R software (version 3.5.1). 


Canonical pathway integrative analysis 

Canonical pathway activation and downregulation were predicted 
in QIAGEN Ingenuity pathway analysis. Ingenuity pathway analysis 
was used to identify the cascade of upstream and downstream 
regulators of the core gene set. Ingenuity pathway analysis uses a 
priori knowledge of expected interactions between transcriptional 
regulators and their target genes stored in Ingenuity Knowledge 
Base, ascientific literature-based database https://www.g6g-software- 
directory.com/bio/cross-omics/dbs-kbs/20018U-Ingenuity-Knowl- 
edge-Base.php. 


miRNA target analysis 

Predicted targets were retrieved from TargetMiner™, TargetScan®, 
and miRDB**” databases through the searchable database miR- 
Base*’, while experimentally validated targets were retrieved from 
miRTarBase”. 


Preparation of miRNA encapsulated liposomes 

A lipid mixture of 1,2-dipalmitoyl-3-dimethylammonium-propane 
(Avanti Polar Lipids), cholesterol (Sigma-Aldrich), and DSPE-PEG 2000 
(Avanti Polar Lipids) in a molar ratio of 42:48:10 was dissolved in100% 
ethanol. The mixed lipids were added to 125 mmol/l sodium acetate 
buffer (pH 5.2) to yield a solution containing 35% ethanol. The resultant 
nanoparticles were extruded through a 0.08-tum membrane (What- 
man) using a LIPEX Extruder (Northern Lipids) to form120-to140-nm 
nanoparticles. miRNA in 50 mmol/l sodium acetate (pH 5.2) and 35% 
ethanol was added to the nanoparticles at 538:1 lipid: miRNA ratios and 
incubated at 37 °C for 30 min. Ethanol removal and buffer exchange 
of miRNA-containing nanoparticles were achieved by dialysis against 
PBS using a13,000-kDa MWCO dialysis bag (Spectra/Por Dialysis Mem- 
brane) for 24 hat 4 °C, with the external medium exchanged after 1, 3, 
and 24 h. Finally, the formulation was filtered througha 0.2-um sterile 
filter. Particle size and zeta potential were determined using a Zetasizer 
Nano ZS (Malvern). miRNA entrapment efficiency was determined by 
the Quant-iT RiboGreen RNA assay (Invitrogen). For labelling, 50 pl 
of 1mg/ml 18:1 Liss Rhod PE 1,2-dioleoyl-sn-glycero-3-phosphoetha- 
nolamine-N-(lissamine rhodamine B sulfonyl) (ammonium salt) was 
added to the lipid mixture. 


Statistical analysis 

The unpaired, two-tailed t-test and one-way ANOVA with Tukey multiple 
comparisons were carried out to analyse in vitro data. For mouse stud- 
ies, atwo-way ANOVA was used to compare tumour volumes between 
control and treatment groups. For immunohistochemical analyses, a 
one-way ANOVA was used to compare control and treatment groups. 
Survival was analysed by the Kaplan-Meier method and compared 
using the log-rank test. All bar graphs were expressed as mean +s.e.m. 
with individual values shown if n < 12. P values of less than 0.05 were 
considered to indicate nominal statistical significance. On the basis of 
the variance of xenograft growth in control mice, we used at least three 
mice per genotype to give 80% power to detect an effect size of 20% 
witha significance level of 0.05. For all mouse experiments, the number 
of independent mice used is listed in the figure legend. No statistical 
methods were used to predetermine sample size. The experiments 
were not randomized and investigators were not blinded to allocation 
during experiments and outcome assessment. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Neuron RNA sequencing data from in vivo and in vitro experiments 
are available from the Gene Expression Omnibus (GEO) under acces- 
sion number GSE134220. mRNA array data are available on GEO under 
accession number GSE140189, and miRNA array data are available on 
GEO under accession number GSE140324. All other data are available 
inthe article and source data, or from the corresponding author upon 
reasonable request. 
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Extended Data Fig. 1| High nerve density was associated with the presence 
of p53 mutation. a, Representative haematoxylin and eosin image of OCSCC 
samples from TCGA demonstrating low (top) and high (bottom) nerve 
densities; data independently replicated in 231samples. Asterisks represent 
neural structures. b, Overall survival of patients with OCSCC with high (24 
neurofilaments per field) and low nerve densities. Two-sided log-rank test. 

c, Quantification of nerve density in TCGA OCSCC patient cohort (n=231). 
Bar graphs represent mean +s.e.m. Unpaired two-tailed t-test. d, Serial in vivo 
analyses of PCI-13 cell engraftment and growthin BALB/c (nu/nu) mice (n=6 
per group). Tumour growth curves represent mean tumour volume +s.e.m. 
Unpaired two-tailed t-test. e, Representative immunofluorescence of 
glossectomy specimens taken 4 weeks after orthotopic injection of isogenic 
p53" or p53™" PCI-13 cells; data independently replicated in 16 mice. 


f, Quantification of nerve areain p53™' and p53™" OCSCC xenografts (n=8 mice 
per group). g, Quantification of neuritogenesis in DRG co-cultured with p53- 
isogenic PCI-13 cells or normal oral keratinocytes (n= 6 biologically 
independent ganglia per cell line). h, Immunoblots demonstrating the 
knockdown of TP53in human HN30 OCSCC cells; data replicated intwo 
independent experiments. i, Representative immunofluorescence staining of 
neo-neurites (B3-tubulin’) in DRG co-cultured with HN30 (left) and HN30- 
shp53 (right) OCSCC cells. Data independently replicated in13 samples. 

j-I Invitro quantification of number (j), branching (k), and length (I) of 
neurofilaments protruding from ex vivo DRG co-cultured with HN30 (n=8 
ganglia) or HN30-shp53 (n=5 ganglia) OCSCC cells. Bar graphs represent 
mean +s.e.m. One-way ANOVA with Tukey multiple comparison. 
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Extended Data Fig. 2 |OCSCC-derived EVs and neuritogenesis. 

a, Representative transmission electron microscopy image of EVs from 
isogenic PCI-13, HN30, and HN31 OCSCC cells; data replicated in two 
independent experiments. b-d, Size distribution from nanoparticle tracking 
analysis of particles derived from p53" (b) or p53™" (c) PCI-13 cells or HN30 and 
HN31 (d) cells; data replicated in two independent experiments. e, Western blot 
of the EV marker CD63 in EVs from PCI-13 and HN30/HN31 cells; data replicated 
in two independent experiments. f, Western blot of the p53 and controls 
(HSP70) in p53" or p53" PCI-13 and HN31 cells and their corresponding cell- 
derived EVs; data replicated in three independent experiments. g, Confocal 
immunofluorescence images showing EVs (lipophilic Dil-labelled, red) inthe 
cytoplasm of aneuron (labelled with B3-tubulin, green) 8 hafter application of 
EVs derived from p53™" or p53™" PCI-13 cells. Percentage represents the 
proportion of Dil’ B3-tubulin* neurons out of all B3-tubulin® neurons (n=6 
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ganglia per condition). h, Immunoblots demonstrating the knockout (KO) of 
RAB27A and RAB27B in OCSCC cells edited with sgRNAs targeting RAB27A and 
RAB27B (HN31 clones 11and 18, respectively) compared with HN31 controls; 
data replicated in two independent experiments. i, Nanoparticle tracking 
analysis of EV particle number in conditioned medium from HN31 clones 11 
(n=5 biologically independent samples) and 18 (n=7 biologically independent 
samples) compared with HN31 controls (n= 6 biologically independent 
samples). Number of EVs was adjusted to cell number; bars represent 

mean +s.e.m. Unpaired two-tailed ¢-test.j, k, In vitro quantification of 
branching (j) and neurofilament length (k) in freshly collected DRGs cultured 
with conditioned medium from HN31 RAB27A“*RAB27B™ (n=8) and HN31 
RAB27A‘ RAB27B” (n=5) isogenic human OCSCC cells. Bar graphs and tumour 
growth curves represent mean tumour volume +s.e.m. Unpaired two-tailed 
t-test. 
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Extended Data Fig. 3 | See next page for caption. 


Extended Data Fig. 3 | p53-dependent miRNA in OCSCC. a, OCSCC RNA 
transfer to neurons via EVs. Representative confocal immunofluorescence 
image demonstrating PCI-13 cell-derived RNA labelled with SYTO RNASelect 
(green) inthe perinuclear cytoplasm of a neuron (labelled with B3-tubulin, 
red). Images were captured 12 h after application of EVs derived from PCI-13 
cells labelled with SYTO RNASelect; data replicated in two independent 
experiments. b, EVs derived from PCI-13 cells contained mainly small RNA 
species. Bioanalyzer results showing presence of RNAin EVs from PCI-13 cells. 
Representative band of EV RNA by Agilent RNA Pico Chips; data independently 
replicated inten experiments. c, An unsupervised hierarchical clustering heat 
map showing differentially expressed EV miRNAs between p53-isogenic PCI-13 
cells. p53", n=3 biologically independent samples; p53™" and p53™, n=14 
biologically independent samples. d, Heat map of differentially expressed 
miRNA, arranged by unsupervised hierarchical clustering, presenting the 
miRNA sequencing for EVs derived from isogenic PCI-13 cells expressing p53" 
versus no p53 (p53"”) or mutant p53 (p53?**, p53", and p53*?34), The 
Pearson distance and Ward’s minimum variance method were used for pairwise 
clustering (c, d). Red and green indicate increased and decreased expression 
levels, respectively (n=2 to 5 per group). e, Fold change in hsa-miR-141-5p and 
hsa-miR-34a-5p in EVs derived from p53" PCI-13 cells (blue, n=3 biologically 


independent samples) compared with p53™ or p53™* cells (red, n=14 
biologically independent samples). Results are log, normalized. f, Real-time 
PCR quantification of miR-34a and miR-141 in ventral tongues from Trp53/°""* 
and Krts“°Trp53°°" mice (n=7 per group). g, Real-time PCR quantification of 
CDK6 (miR-34a target) and ZEB1 (miR-141 target) in neurons treated with 
antagomiR-34 or antagomiR-141 compared with nonspecific antagomiR- 
treated controls (n=3 biologically independent samples per group). 

h, Quantitative validation of miR-34a and miR-141 overexpression after 
transfection with miR-34a and miR-141 mimics, respectively. TG neurons were 
transfected with miR-34a mimic, miR-141 mimic, or scramble miR, and 
overexpression of miR-34a and miR-141 was confirmed by real-time PCR (n=7 
biologically independent samples per group). i, Real-time PCR quantification 
of miR-34a in orthotopic tumour xenografts of HN30 OCSCC cells treated with 
shControl (blue) or shmiR-34a (purple). n=4 biologically independent samples 
per group.j, k, Western blot of NOTCHI1 (confirmed miR-34a target) in OCSCC 
transfected with lentiviral miR-34a inhibitor or scramble miRNA inhibitor (j). 
Bar graph quantification of the blots demonstrates no impact of miR-34a 
inhibition on p53 expression and is normalized to the total amount of B-actin 
(n=4 biologically independent samples per group,j). Unpaired two-tailed ¢- 
test; bars and dot plots represent mean +s.e.m. (e-i,k). 
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Extended Data Fig. 4| microRNAs modulate neuritogenesis. a, Screening of 
candidate neuritogenesis-associated miRNAs. Quantification of 
neuritogenesis 72 hafter neuron-EV co-culture. Eight hours after transfection 
with 13 different antagomiRs, TG neurons were incubated with EVs derived 
from p53™" PCI-13 cells (n=4 biologically independent samples per condition). 
One-way ANOVA with Tukey multiple comparison. b, Quantification of 
neuritogenesis in TG neurons 72h after transfection with miR-21 mimic, miR- 
197 mimic, or miR-324 mimic or co-transfection with their combinations 

(n=3 biologically independent samples per condition). c, Representative 
fluorescent-bright-field overlay images demonstrating the lack of response of 
TG neurons exposed to EVs derived from p53™" PCI-13 cells after co- 
transfection with antagomiR-21 and antagomiR-324 and the response of TG 


neurons after miR-21 and miR-324 mimic co-transfection. Data replicated 
across six independent samples. d, Quantitative validation of miR-21 and miR- 
324 overexpression in TG neurons incubated with liposomes containing miR- 
21, miR-324, and scramble miRNA (n=3 biologically independent samples per 
group). e, Representative fluorescence-bright-field overlay images of TG 
neurons exposed to liposomes containing miR-21, miR-324, and scramble 
miRNA or liposomes containing miR-21, miR-324, and miR-34a; data 
independently replicated in 20 wells. f, Quantification of neuritogenesis in TG 
neurons 72 hafter neuron-liposome co-culture (n=5 biologically independent 
samples per condition). Unpaired two-tailed t-test; bars represent 

mean +s.e.m. (a, b, d) or one-way ANOVA with Tukey multiple comparisons (f). 
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Extended Data Fig. 5| See next page for caption. 
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Extended Data Fig. 5| TP53 deficiency does not change parasympathetic 
nerve fibre densities in human OCSCC specimens. a, b, Representation of 
vesicular acetylcholine transporter (VAChT)* nerve densities in both 7P53”" 
and TP53”“ OCSCC tissues; data independently replicated in 24 patient 
specimens (a). Quantification of cholinergic VAChT’ neural areas in TP53- 
sufficient (7P53”’, blue, n=12) and 7P53-deficient (7P53”, red, n=12) human 
OCSCC tissues. Each dot represents the mean for one patient (b). NF-H, 
neurofilament heavy. c,d, 7P53 deficiency increases the sympathetic nerve 
fibre density in normal tongue tissue surrounding OCSCC in humans. 
Representative images showing TH’ adrenergic neural fibres in human 
normal tongue tissue surrounding OCSCC with 7P53” (left) or TP53”™ (right) 
(TH, green; neurofilament light (NF-L), red; DAPI, blue). Dataindependently 
replicated in 24 patient specimens (c). Quantification of adrenergic TH’ areas 
in TP53-sufficient (TP53"’, blue, n=12) and TP53-deficient (7P53™, red, n=12) 
human OCSCC samples. Each dot represents the mean for a patient (d). 

e, Correlation of TH and NF-L expression levels. Linear regression (r?,n = 24 
biologically independent samples). f-h, Representative images of TG neurons 
labelled with anti-TH antibody after incubation with EVs derived from p53™' or 
p53™" PCI-13 cells; data independently replicated in 14 wells (f). Quantification 


of TH’ neurons (n=7 biologically independent samples per condition, g), and 
noradrenaline levels (n= 4 biologically independent samples per condition, h). 
i-k, Co-culture of TG neurons with p53™" EVs for 72 h induced TH coexpression 
in TRPVI‘ but not IB4* neurons; TH expression remained stable 72 h after 
washout of the EVs. n=4 biologically independent samples per condition (k). 

1, TH* neural areas in PCI-13-p53™" orthotopic tumours injected daily with no 
EVs or with EVs derived from p53" or p53“ PCI-13 cells for 3 weeks; data 
independently replicated in15 mice. m, Co-culture of TG with liposomes 
containing miR-21 and miR-324 but not miR-34a increases catecholamine 
synthesis. Noradrenaline levels in neurons cultured with nano-liposomes 
containing miR-21 + miR-34a + miR-324 or miR-21 + miR-324 controls, quantified 
by enzyme-linked immunosorbent assay. n= 3 biologically independent 
samples per condition. n, Heat map of differentially expressed genes in mouse 
TG neurons co-cultured with p53-isogenic EVs. Enriched Gene Ontology terms 
of the neurons were plotted at fold enrichment with the associated log Pvalue 
(Fisher’s exact algorithm for functional gene set enrichment); n=3 biologically 
independent samples for p53™' and n=4 biologically independent samples for 
p53™", Mean+s.e.m.; unpaired two-tailed t-test (b, d, g, h,k, m). 
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Extended Data Fig. 6| Target analysis for miR-21, miR-34a, and miR-324. 

a, Schematic of Ingenuity pathway analysis of pain signalling (silenced, green) 
and noradrenaline biosynthesis (activated, red). AADC, aromatic L-amino acid 
decarboxylase; DBH, dopamine B-hydroxylase; PNMT, phenylethanolamine 
N-methyltransferase. b, Total RNA was collected to assess global gene changes 
after TG neurons were transfected with miR-21 (red), miR-324 (blue), miR-34a 
(green), or scrambled miRNA. Data show fold change in expression of potential 
targets involved in neural identity determination (left), neural growth (middle), 


0 10 06 10 

Fold change 
and neural function (right) between neurons transfected with different 
miRNAs (n=3 biologically independent samples per condition). Pvalues 
obtained from the moderated t-statistic for the presented genes were <0.05. 
Forest plots displaying fold change and P values for genes on different neuron 
pathways that are significantly differentiated between TG neurons transfected 
with miR-21, miR-34a or miR-324 and scramble miRNA. Linear models and 
empirical Bayes methods were used for obtaining the statistics and assessing 
differential gene expression between two conditions. 


Article 


a Murine TG neurons b p53" xenograft Clpsilateral TG, p53™" xenograft 
5 a any 15 
4 f A 
4 be : 10 
=I 
i | 
2. 4; ; ; 5 
5 1 pays a cas ie 5 
So 50 um: | So 
TH 10° 10* 10° 108 Sham surgery Lingual denervation TH 10° 10* 10° 108 
— |_——____—_—_——_ > 
m 6-OHDA, no tumor > Perivascular TH’ fibers m= Sham surgery 
m 6-OHDA, p53™"" xenograft > Cancer-associated TH* fibers = Lingual denervation 
d Murine TGs 
; a 4 & ~ > 
ae Fe ee 
x . Ae *. a be \ 2 .- a4 
- aoe fae jae8° > ea SS 5 
iS — oS af 
/100"m* By. a 
No tumor p53" tumor p53" tumor 
e No tumor p53"T tumor p53" tumor 
| 
Cc 
a 
fe) 
S) 
10° 108 
TH > 
f 
10 
— 8 
— HN30-Lenti 
z & ey... | PSaenos 
allt  « oyfiera. . = ~deeeer Lingual denervation 
8 4 : J 
B EET HN30-shmiR-34a 
Fa oe ooe — Sham surgery 7 P=0.025 
0. : ; : as Lingual denervation 
0 5 10 15 20 24 
Days after inoculation 
h i 
g 107 4 200- 
o P<0.0001 ° P< 0.0001 3 
@ ©) p<0.0001 P<0.0001 sh D 1504 7 
2 s = P<0.0001 P<0.0001 
3° 3 2 
c a = 100- 
6 44 s c 
. = 2 P=0.88 HN30 xenograft: 
g F 4 BZ 504 
| © Bi Control 
=] ie} . 
2 $ z BB shmir-34a 
0- 0 0- 
Lingual = - + - + Lingual = - + + Lingual = - + - + 
denervation denervation denervation 


Extended Data Fig. 7 | Loss of p53 in OCSCC induces adrenergic switch 
proximally in TG neurons. a, Flowcytometry quantification of neurotrophin- 
3-positive (NT3*), TH* neurons in freshly collected ipsilateral TG neurons 

3 weeks after orthotopic inoculation of p53™" PCI-13 cells to the tongues of 
sympathectomized mice. Non-tumour-bearing, sympathectomized mice were 
used as controls (n= 6). b, Representative immunohistochemical analysis for 
TH’ in orthotopic xenografts; data independently replicated in 16 mice. c, Flow 
cytometry quantification of NT3‘TH* neurons in ipsilateral TG neurons (n=6 
mice per condition). d, Representative images of TH’ TG neurons in mice 
without tumours (left) and 3 weeks after injection of p53" (middle) or p53™"' 
(right) PCI-13 cells to the ipsilateral tongue; data independently replicated in 
nine mice. e, Flowcytometry quantification of NT3* TH’ neurons in freshly 


collected ipsilateral TG 3 weeks after orthotopic inoculation of p53"' (middle) 
and p53™" (right) PCI-13 cells to the tongue. Non-tumour-bearing mice were 
used as controls (left, n=12 per group). f, Serial in vivo analyses of tumour 
growth after engraftment of HN30 transfected with either control lentivirus 
(HN30-lenti) or shmiR34a (HN30-shmiR34a) into BALB/c (nu/nu) mice. Mice 
were randomized and underwent lingual denervation or sham surgery 1 week 
before cell injection (n=8 per group). Tumour growth curves represent mean 
tumour volume +s.e.m.; unpaired two-tailed t-test. g—i, Neural density (g), TH® 
area (h), and noradrenaline levels in vivo (i) in HN30-lentiand HN30-shmiR34a 
orthotopic xenografts with and without lingual denervation (n=5 biologically 
independent samples per condition). Bars indicate mean +s.e.m.; unpaired 
two-tailed t-test. 


159 


-1.5 0 


p5gnul 


Murine TG neurons 5_, Log10 P-value 
So 4] 0-2 -4 -6 -8-10 
Een 


{ed 


Neurotransmission Pathway Type 
Proliferation Stemn 
4 bol | 


Group 
= p53™" tumor m= p53" tumor 


‘onogenesis 
Sprouting || | 
NANOG pathway 


je of neurites 
Morphogenesis of neurons 


TF signaling 


Self-renewal of cells 


Emb 


Neuritogenesis | 
S1P signaling 
hin signaling 
onic stem cell 
ing in neurons 


Development of neurons 


al 


Branching of neurites 
Growth of neurites | 


B-adrenergic signaling 
a-adrenergic signaling 
Shape cheng: 


Ephrin receptor signaling 
Neurotro 
CN 


> CREB sign 


Pathway Associatio 
a Yes No 


m p53" tumor 
m@ p53%" tumor 


10* 


106 


J NeuO-Bright 
Foy a 


MV 


Contralateral 
____ NeuO-Bright 


Ipsilateral 
105, 


104 


10% 


1074 107, 
4 4997 neurons 4 4467 neurons 
m0 10° 10° 10° = 10. 10° 10* 10° 
Autofluorescence > 
h PCI-13-pBabe Ii 
Carvedilol (p.o.) g 20 . 504 xenografts $2100. 
= 3 ° + 904 ——TH low —W—TH high 
E PCI-13-pBabe (p53) 2 40+p-00072 S§ 804 
Cell Tissue Ee : oS —S = 70-4 HL ee on ee TT 
inoculation | harvest 2 — VenIge: ]P=0.0002 S230 a 604 ; 
os 18days Eid goo Carvedilol Zs 8 50-4 | 
g 6 £20: = 405 boi Lo 
= HN30-shp53 se 8 304 
2 i a4 6 20 
E — Nehicle, 7 P<0.0001 E 0: r g ral 
ee Carvedilol 3 Ei 2 
o-~— a 0 T T T T 1 
0. 7 13 S&S r 0 624 48°72 86120 
Days after inoculation ea & Time (months) 
ea 


Extended Data Fig. 8 | See next page for caption. 
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Extended Data Fig. 8 | Characterization of OCSCC-induced neural 
transcriptional program. a, Heat map of differentially expressed genes 
arranged by unsupervised hierarchical clustering in TG neurons 3 weeks after 
orthotopic injection of p53™" or p53" PCI-13 cells (n=3 biologically 
independent samples per condition) and enriched Gene Ontology terms 
plotted by fold enrichment with the associated log P value (right; Fisher’s exact 
algorithm for functional gene set enrichment). b, Flowcytometry 
quantification of NeuroFluor-positive (NeuO*), POUSF1' (left), NeuO*KLF4* 
(middle), and NeuO*ASCLI' (right) neurons in ipsilateral TGs after orthotopic 
injection of p53™" or p53" PCI-13 cells; data independently replicated in six 
mice. c, Representative images in freshly collected TG neurons (red, NF-H") 
from BALB/c (nu/nu) mice after orthotopic injection of either p53™" or p53" 
PCI-13 cells to the ipsilateral tongue, demonstrating lack of nuclear expression 
of the transcription factors SOX2, TBR2, and DCX and similar neurogenin2 
expression between groups. Data independently replicated in six mice; DAPI, 
blue. d, e, Representative necropsy photograph (TG delineated by dashed line 


inc) and flowcytometry quantification of NeuO* neurons in freshly collected 
TG 3 weeks after tumour injection to the left side of the tongue; ipsilateral 

(c, right, black arrowhead) and contralateral (c, left, white arrowhead) ganglia 
were similar in size. None of the tumours crossed the midline of the tongue 
(n=6).f, g, Mice were treated daily with either B-adrenergic receptor blocker 
carvedilol or vehicle via oral gavage. On day 5, mice were orthotopically 
xenografted with human p53-deficient (p53"" PCI-13 or HN30-shp53) cells to 
the tongue. Serial in vivo tumour volume measurement (n=12 per group except 
for p53™" PCI-13 tumour-bearing mice with carvedilol treatment, n=13;f). 

h, Adrenergic inhibition decreases OCSCC proliferation in vivo. Carvedilol 
injections inhibited the proliferation of p53™" PCI-13 cells orthotopically 
implanted into the tongue, as determined by Ki-67 expression (n=6 
biologically independent samples per condition). i, Kaplan-Meier curves 
showing the recurrence-free survival of patients with high (>2,000 ppm? per 
field) and low (<2,000 pm? per field) TH’ adrenergic nerve densities. 

Mean +s.e.m.; unpaired two-tailed t-test. 
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Reporting Summary 


Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency 
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist. 


Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


— The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


— For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection Data was collected using Microsoft Excel 2016 64-bit (Microsoft), Vevo 2100 ultrasound imaging system, iQ3 (Andor), the PerkinElmer 
Vectra platform, Axio Scan.Z1 (Zeiss), FACS Diva Version 8.0.1 (BD Biosciences), NanoSight LM10 Nanoparticle Characterization system, 
AMT Imaging System (Advanced Microscopy Techniques Corp.). 


Data analysis Data was analyzed using GraphPad Prism Version 7.03, SAS JMP Pro software version 12.1.0, the R language environment for statistical 
computing version 3.1.3, imageJ 1.52 (NIH, USA), the FilamentTracer module in the Imaris software (Bitplane), the Simple Neurite Tracer 
plugin in the Fiji build of ImageJ (NIH), TissueFinder in PerkinElmer inForm software, Pannoramic Viewer (3DHISTECH), FACSDiva 8.0 
software (BD Biosciences), FlowJo 1.0.1, Kaluza software (Beckman Coulter), the package edgeR, QIAGEN Ingenuity Pathway Analysis. 
Western blots were cropped in Adobe Photoshop CC 2017 and Adobe Illustrator CC 2017. miRNA analysis: Cutadapt version 1.8.1 for 
trimming adapters from reads, BWA version 0.7.15 for mapping the short reads, SAMtools version 1.2 for manipulating bam/sam files 
featureCounts from Subread version 1.5.2 for miRNA quantification. MRNA analysis: murine: FastQC version 0.11.7 for read quality check, 
TopHat2 version 2.0.14 for mapping, SAMtools version 1.2 for manipulating bam/sam files, HTSeq version 0.6.1 for gene quantification; 
human: FastQC version 0.11.8 for read quality check, TopHat2 version 2.1.1 for mapping, SAMtools version 1.2 for manipulating bam/ 
sam files, HTSeq version 0.11.0 for gene quantification. 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 
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Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Source data for the figures and extended data figures are provided. The datasets generated during the current study are available from the corresponding authors 
upon reasonable request. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


x] Life sciences Behavioural & social sciences [ | Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size Samples sizes from other experiments were estimated from similar experiments in former publications of the group. 


Sample sizes for clinical data: 

No statistical methods were used to predetermine sample sizes and samples were selected based upon the availability of data as outlined 
below. 

-TCGA cohorts of head and neck cancer (oral cancer) were assessed. 

-TP53 status and the expression levels of tyrosine hydroxylase, vesicular cetylcholine transporter, neurofilament-light, and neurofilament- 
heavy in glossectomy tissues from treatment-naive patients with tongue cancer treated at the University of Texas MD Anderson Cancer 
Center, Houston, Texas were analyzed (n = 24, 12 with wild-type TP53 and 12 with mutant TP53). 

-The expression levels of tyrosine hydroxylase within tumor areas in patients with oral cavity squamous cell carcinoma was evaluated and 
compared with their clinical characteristics and survival (n = 70). 

For human DRGs, experiments were prioritized based on the availability of the specimen. 


n vitro studies: 


n vivo studies: 

Animal experiments were conducted using between three and 13 mice per group, based on the results of pilot studies and previous studies 
that have given statistical significant results based on the variance of xenograft growth in control mice. These power calculations indicated 
use of at least 3 mice per genotype to give 80% power to detect an effect size of 20% with a significance level of 0.05). 


Data exclusions o data excluded. 


Replication All attempts at replication were successful. Experiments were performed at least two times and/or with sufficient cells/animals per group to 
demonstrate statistical significance. Number of replicates of each experiment is indicated in the corresponding figure legend. 
Data was replicated using: 
1. Human specimen (cancer and neurons) 
2. Orthotopic xenograft models (HN30, HN31, PCI13, p53- and miR34- isogenic cell lines) 
3. Transgenic (Trp53) model 


We have designed (CRISPR/Cas9) 2 independent clones (#11 and#18) of RAB27A-/-;RAB27B-/- knock out cells. 


We have designed p53 function studies using genetic approach with 5 isogenic PCI-13 (exogenous p53) cell lines and 2 isogenic HN30 
(endogenous p53) OCSCC cell lines. 


Randomization Animals were randomly allocated for surgical denervation or carvedilol/6OHDA treatment prior to surgery or treatment initiation, 
respectively. For exosome injection studies and orthotopic models with no intervention (other than tumor engraftment) animals were 
randomly allocated prior to cell inoculation. 


Blinding The researchers were blinded during the measurement of the tumor size and body weight of mice except in the experiment shown in Fig. 3i-k. 


The researchers were blinded during outcome assessment. 


Reporting for specific materials, systems and methods 


sample sizes were determined based on the results of pilot studies, and previous similar studies that have given statistically significant results. 
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We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 

n/a | Involved in the study n/a | Involved in the study 

r | Antibodies |] ChiP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


Human research participants 


Clinical data 
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Antibodies used Rat anti-human/mouse Oct3/4-APC (Clone:240408, R&D systems, cat#:IC1759A, 1:400). 
Mouse anti-human CD63 (unconjugated, Clone TS63, Abcam, cat#:ab59479, 1:1000 ,Lot:GR3186539-1 ). 
Rabbit anti-TRPV1 (unconjugated, Clone VR1, Alomone Labs cat#:ACC-030; 1:500, Lot:ACCO30AG1040). 
Rabbit anti-doublecortin (unconjugated, polyclonal, Abcam, cat#:ab18723, 1:1000, Lot#: GR3224908-1). 
Rabbit anti-Rab27B (unconjugated, polyclonal, Abcam, cat#:ab103418, 1:200, Lot #: GR3238038-3). 
Mouse anti-Rab27A (unconjugated, monoclonal, Abcam, cat#:ab55667, 1:200, Lot #: GR3198959-1). 
Rabbit anti-TH (unconjugated, polyclonal, Bioss, cat#:bs-OO16R-A488, 1:50, Lot:AEO92200). 
Mouse anti-TP53 (unconjugated, Clone DO-1, Cell Signaling, cat#:18032, 1:1000). 
Rabbit anti-Notch1 (unconjugated, Clone D1E11, Cell Signaling, cat#:3608, 1:1000). 
Mouse anti-B actin (unconjugated, Clone AC-15, Sigma, cat#:A1978, 1:4000). 
Mouse anti neuron-specific 83-tubulin-APC (Clone TUJ-1, R&D systems, cat#:1C1195A, 1:100, Lot: ABFKO117091). 
Rabbit anti-Oct4 (unconjugated, polyclonal, Abcam, cat#:ab18976, 1:100, Lot:GR3202710-3). 
Rabbit anti-83-tubulin (polyclonal, unconjugated, Abcam, cat#:ab18207, 1:2000, Lot#: GR3221401-3 and GR3196636-1). 
Chicken anti-neurofilament heavy (polyclonal, unconjugated, Abcam, cat#:ab4680, 1:1000, Lot:GR3241438-4). 
Rabbit anti-MASH1/ASCL1 (polyclonal, unconjugated, Abcam, cat#:ab74065; 1:250, Lot:GR3178505-1). 
Rabbit anti-TH (polyclonal, unconjugated, EMD Millipore, cat#: AB152, 1:400, Lot:2971004, 3031639 and 3072361). 
Rabbit anti-neurogenin2 (unconjugated, polyclonal, EMD Millipore, cat#:AB5682, 1:400, Lot:3066197). 
Rabbit anti-neurofilament heavy polypeptide (unconjugated, polyclonal, Abcam, cat#:ab8135, 1:1000, Lot:GR3224043-9). 
Chicken anti-68kDa NF/NF-L (unconjugated, polyclonal, Abcam, cat#:ab24520, 1:700, Lot:GR3206616-5). 
Mouse anti-SOX2 (unconjugated, Clone 245610, R&D systems, cat#: IC2018V-100UG, 1:500, Lot:1502977). 
Rabbit anti-neurofilament L (unconjugated, polyclonal, EMD Millipore, cat#:AB9568, 1:700, Lot:2943223). 
Goat anti-VAChT (unconjugated, polyclonal, EMD Millipore, cat#:ABN100, 1:1000, Lot: 2899777). 
Chicken anti-Tbr2 (unconjugated, polyclonal, EMD Millipore, cat#:AB15894, 1:500, Lot:3071578). 
Mouse anti-pancytokeratin (unconjugated, Polyclonal PAN-CK, ThermoFisher, cat#:MA5-13203, 1:50). 
Rabbit anti-KLF4 (unconjugated, polyclonal Novus, cat#:NBP2-24749, 1:100, Lot:05232589B-07). 


Validation Well-validated human and mouse antibodies were purchased from established commercial vendors, including Abcam, Sigma, 
EMD Milipore, Novus, Bioss, R&D Systems, Cell Signaling, and Life Technologies. Unless otherwise noted, antibodies were used 
at manufacturer- and primary literature-validated concentrations for the relevant assays, as detailed below: 
Rat anti-human/mouse Oct3/4-APC (Clone:240408, R&D systems, cat#:IC1759A, 1:400). Mutnal MB et al. Murine 
cytomegalovirus infection of neural stem cells alters neurogenesis in the developing brain. PLoS ONE, 6(1):e16211 (2011). 
Validated data in FC by the provider. 
Mouse anti-human CD63 (unconjugated, Clone TS63, Abcam, cat#:ab59479, 1:1000 ,Lot:GR3186539-1 ), Matsuzaki K et al. 
MiR-21-5p in urinary extracellular vesicles is a novel biomarker of urothelial carcinoma. Oncotarget 8:24668-24678 (2017). 
Zonneveld MI et al. Recovery of extracellular vesicles from human breast milk is influenced by sample collection and vesicle 
isolation procedures. J Extracell Vesicles 3:N/A (2014). Validated data in IHC, WB, IF by the provider. 
Rabbit anti-TRPV1 (unconjugated, Clone VR1, Alomone Labs cat#:ACC-030; 1:500, Lot:ACCO30AG1040), Devesa et al. aCGRP is 
essential for algesic exocytotic mobilization of TRPV1 channels in peptidergic nociceptors. Proc Natl Acad Sci USA. 
111(51):18345-50 (2014). Knockout-validated by provider. 
Rabbit anti-doublecortin (unconjugated, polyclonal, Abcam, cat#:ab18723, 1:1000, Lot#: GR3224908-1), Kovalchuk Y et al. In 
vivo odourant response properties of migrating adult-born neurons in the mouse olfactory bulb. Nat Commun 6:6349 (2015). 
Validated data in IHC, WB, FC, IF by the provider. 
Rabbit anti-Rab27B (unconjugated, polyclonal, Abcam, cat#:ab103418, 1:200, Lot #: GR3238038-3). Jiang Y et al. MicroRNA-599 
suppresses glioma progression by targeting RAB27B. Oncol Lett 16:1243-1252 (2018). Validated data in WB by the provider. 
Mouse anti-Rab27A (unconjugated, monoclonal, Abcam, cat#:ab55667, 1:200, Lot #: GR3198959-1). Uchino K et al. Therapeutic 
Effects of MicroRNA-582-5p and -3p on the Inhibition of Bladder Cancer Progression. Mol Ther 21:610-9 (2013). Knockout- 
validated data in WB by the provider. 
Rabbit anti-TH (unconjugated, polyclonal, Bioss, cat#:bs-OO16R-A488, 1:50, Lot:AEO92200). Validated data in IHC, WB, FC by the 
provider. Additional supportive validation exist in Antibodypedia. 
Mouse anti-TP53 (unconjugated, Clone DO-1, Cell Signaling, cat#:18032, 1:1000). Validated data in WB by the provider. 
Rabbit anti-Notch1 (unconjugated, Clone D1E11, Cell Signaling, cat#:3608, 1:1000). Man, Jianghong, et al. Hypoxic induction of 
vasorin regulates Notch1 turnover to maintain glioma stem-like cells. Cell stem cell. 22.1:104-118 (2018).Validated data in WB by 
the provider. 
Mouse anti-B actin (unconjugated, Clone AC-15, Sigma, cat#:A1978, 1:4000). Validated data in WB by the provider. 
Mouse anti neuron-specific B3-tubulin-APC (Clone TUJ-1, R&D systems, cat#:1C1195A, 1:100, Lot: ABFKO117091). Noristani HN et 


wn 


al. Spinal cord injury induces astroglial conversion towards neuronal lineage. Mol Neurodegener. 11(1):68 (2016). Validated data 
in FC by the provider. 

Rabbit anti-Oct4 (unconjugated, polyclonal, Abcam, cat#:ab18976, 1:100, Lot:GR3202710-3), Hassiotou F et al. Expression of the 
Pluripotency Transcription Factor OCT4 in the Normal and Aberrant Mammary Gland. Front Oncol 3:79 (2013). Lee SJ et al. Adult 
stem cells from the hyaluronic acid-rich node and duct system differentiate into neuronal cells and repair brain injury. Stem Cells 
Dev 23:2831-40 (2014). Validated data in IHC, WB by the provider. 

Rabbit anti-B3-tubulin (polyclonal, unconjugated, Abcam, cat#:ab18207, 1:2000, Lot#: GR3221401-3 and GR3196636-1), 
Delgado-Esteban M et al. APC/C-Cdh1 coordinates neurogenesis and cortical size during development. Nat Commun 4:2879 
2013). Validated data in IHC, WB, FC, IF by the provider. 

Chicken anti-neurofilament heavy (polyclonal, unconjugated, Abcam, cat#:ab4680, 1:1000, Lot:GR3241438-4). Wirt SE et al. G1 
arrest and differentiation can occur independently of Rb family function. J Cell Biol 191:809-25 (2010). Validated data in IHC, WB, 
F by the provider. 
Rabbit anti-MASH1/ASCL1 (polyclonal, unconjugated, Abcam, cat#:ab74065; 1:250, Lot:GR3178505-1). Validated data in IHC, 
WB, IF by the provider. 
Rabbit anti-TH (polyclonal, unconjugated, EMD Millipore, cat#: AB152, 1:400, Lot:2971004, 3031639 and 3072361), Magnon C 
et al. Autonomic Nerve Development Contributes to Prostate Cancer Progression. Science 341(6142):1236361 (2013). Validated 
data in IHC, ELISA WB, IF by the provider. 

Rabbit anti-neurogenin2 (unconjugated, polyclonal, EMD Millipore, cat#:AB5682, 1:400, Lot:3066197), Validated data in IHC, 
ELISA WB, IF by the provider. 

Rabbit anti-neurofilament heavy polypeptide (unconjugated, polyclonal, Abcam, cat#:ab8135, 1:1000, Lot:GR3224043-9); Woo 
SH et al. Piezo2 is required for Merkel-cell mechanotransduction. Nature 509:622-6 (2014). Validated data in IHC, WB, IF by the 
provider. 
Chicken anti-68kDa NF/NF-L (unconjugated, polyclonal, Abcam, cat#:ab24520, 1:700, Lot:GR3206616-5), Validated data in IHC, 
WB, IF by the provider. 
Mouse anti-SOX2 (unconjugated, Clone 245610, R&D systems, cat#: IC2018V-100UG, 1:500, Lot:1502977). Najm, FJ et al. 
Transcription factor-mediated reprogramming of fibroblasts to expandable, myelinogenic oligodendrocyte progenitor cells. Nat 
Biotechnol, 31(5):426-33 (2013). Validated data in WB, FC, IF by the provider. 

Rabbit anti-neurofilament L (unconjugated, polyclonal, EMD Millipore, cat#:AB9568, 1:700, Lot:2943223), Validated data in IHC 
by the provider. 
Goat anti-VAChT (unconjugated, polyclonal, EMD Millipore, cat#:ABN100, 1:1000, Lot: 2899777), Validated data in IHC by the 
provider. 

Chicken anti-Tbr2 (unconjugated, polyclonal, EMD Millipore, cat#:AB15894, 1:500, Lot:3071578); He Y et al. ALKS-dependent 
TGF-B signaling is a major determinant of late-stage adult neurogenesis. Nature neuroscience, 17:943-52 (2014). Validated data 
in WB, IHC by the provider. 

Mouse anti-pancytokeratin (unconjugated, Polyclonal PAN-CK, ThermoFisher, cat#: MA5-13203, 1:50), Validated data in WB, IHC, 
IF by the provider. Additional supportive validation exist in Antibodypedia. 

Rabbit anti-KLF4 (unconjugated, polyclonal Novus, cat#:NBP2-24749, 1:100, Lot:05232589B-07). Validated data in WB, IHC, FC by 
the provider. Additional supportive validation exist in Antibodypedia. 
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Eukaryotic cell lines 


Policy information about cell lines 


Cell line source(s) PCI-13 cells were obtained from the laboratory of Dr. Jennifer Grandis (University of Pittsburgh, Pittsburgh, PA). HN30 and 
HN31 cells were obtained from the laboratory of Dr. John Ensley (Wayne State University, Detroit, MI) and from Dr. Barbara 
Frederick (University of Colorado Health Sciences Center-Colorado), respectively. 


Authentication All human cell lines were authenticated upon arrival by STR profiling using 14 short tandem repeat (STR) loci including the 
gender determining locus, Amelogenin. 


Mycoplasma contamination All cell lines were tested for mycoplasma contamination periodically, including immediately upon receipt via the MycoAlert 
Mycoplasma Testing Kit (Lonza). Results were always negative for mycoplasma contamination. 


Commonly misidentified lines — No commonly misidentified cell lines were used in this study. 
(See ICLAC register) 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Male or female B6.129P2-Trp53tm1Brn/J, and BALB/c nu/nu (B6.Cg-Foxninu+/—) mice at the age of 6 to 8 weeks were obtained 
from the Jackson Laboratory. Krt5-Cre was obtained from Dr. Carlos Caulin. 


Wild animals No wild animals were used. 
Field-collected samples No field-collected samples were used. 
Ethics oversight Study protocols were approved by the University of Texas MD Anderson Cancer Center Institutional Animal Care and Use 


Committee (IACUC). 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Human research participants 


Policy information about studies involving human research participants 


Population characteristics 


Recruitment 


Ethics oversight 


Historical specimen of patients who underwent glossectomy and neck dissection and had histologically confirmed and clinically 
local or locoregional oral cavity cancer at the University of Texas MD Anderson Cancer Center were used. 


For Human DRG experiments written informed consent for participation, including use of tissue samples, was obtained from 
each patient prior to inclusion. The protocol was reviewed and approved by The University of Texas MD Anderson Cancer Center 
Institutional Review Board, and all experiments conformed to relevant guidelines and regulations. Briefly, each donor was 
undergoing surgical treatment that necessitated ligation of spinal nerve roots to facilitate tumor resection or spinal 
reconstruction. Spinal roots were ligated proximal to the DRG, spinal roots were sharply cut both proximal and distal to the DRG, 
and excised DRG were transferred immediately into cold (~4°C) and sterile balanced salt solution containing nutrients. DRG were 
transported to the laboratory on ice in a sterile, sealed 50-mL centrifuge tube. Upon arrival to the laboratory, each ganglion was 
carefully dissected from the surrounding connective tissues and sectioned into several ~1- to- 2-mm pieces. DRG were digested 
in 2 mL of a mixed enzyme solution: 0.1% trypsin (Sigma-Aldrich, T9201), 0.1% collagenase Sigma-Aldrich, C1764; w/\, final 
concentration), and 0.01% DNase (Sigma-Aldrich, D5025) diluted in DMEM/F-12. The pieces of tissue were transferred to a 37°C 
rotator to shake at a speed 124-128 revolutions/min. Every 20 minutes, tissue fragments were allowed to settle, and the 
supernatant/dissociated cells were collected and transferred to DMEM/F-12 with enzyme inhibitor. Supernatant was replaced 
with 2 mL of fresh digestion solution. The tissue was returned to the 37°C rotator, and this process was repeated until tissue 
fragments were well digested. Dissociated cells were centrifuged at 180 rpm for 5 minutes, supernatant was removed, and the 
cells were gently resuspended in culture medium with DMEM/F-12 supplemented with EV free 10% serum and 2 mM glutamine. 
Cells were plated onto laminin-coated p-Slide 8 Well (ibidi) and cultured at 37°C with 5% CO2 for 24-72 hours prior to 
undergoing transfection or labeling. 


Human samples were obtained from surgically resected OCSCC patients treated at the department of head and neck surgery at 
the University Texas MD Anderson Cancer Center, Houston, Tx USA. All patients gave their informed consent for the use of their 
resected specimen. 


Study protocols were approved by the University of Texas MD Anderson Cancer Center Institutional Review Board. 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Flow Cytometry 


Plots 


Confirm that: 


Methodology 


Sample preparation 


Instrument 


Software 


Cell population abundance 


The axis labels state the marker and fluorochrome used (e.g. CD4-FITC). 
The axis scales are clearly visible. Include numbers along axes only for bottom left plot of group (a ‘group’ is an analysis of identical markers). 
All plots are contour plots with outliers or pseudocolor plots. 


A numerical value for number of cells or percentage (with statistics) is provided. 


Primary sensory neurons were isolated from dissected TG as previously described (31). After ganglia dissection, tissue was 
enzymatically digested with papain (40 U/mL, EMD Millipore) for 20 minutes in 37°C followed by 20 minutes of digestion with 
collagenase II (4 mg/mL)/dispase II (4.6 mg/mL) solution. Using Percoll gradient (12.5% and 28% Percoll in complete L-15 medium 
[L-15 with 5% fetal calf serum, penicillin/streptomycin, HEPES]), we separated the myelin and nerve debris from trigeminal 
neurons. All cell sorting experiments (See Neuron Sequencing) were carried out using a FACSAria Cell Sorter (BD Biosciences), 
and all flow cytometric analyses were carried out using an LSR II flow cytometer running FACSDiva 8.0 software (all BD 
Biosciences). 

Samples were run in F-12 serum-free media, and cell doublets, debris, and dead cells were excluded by fetal calf serum, SSC, and 
NeuO (NeuroFluor NeuO, Stemcell Technologies, Ex/Em: 468/557 nm) profiles. Data were analyzed by Kaluza (Beckman Coulter) 
software. To assess adrenergic differentiation and transdifferentiation by flow cytometry, cell suspensions were stained on ice 
with cell surface markers and transcription factors (see Antibodies and staining reagents). Briefly, following cell fixation and 
permeabilization, cells were stained with the primary antibodies according to the manufacturer’s instructions (BD Biosciences). 
After exclusion of dead cells and non-neuronal cells (by NTIII labeling), TH, OCT3/4, SOX2, KLF4, MASH1, doublecortin, TBR1 and 
neurogenin2 fluorescence was measured by flow cytometry. 


BD FACSAria Cell Sorter and BD LSR II (BD Biosciences) were used in this study. 


BD FACS Diva Software Version 8.0.1 was used to collect the data. 
FlowJo Version 1.0.1 was used to analyze the data. 


Cell number subjected to flow cytometry was limited due to availability of the cells from trigeminal ganglia. 
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Gating strategy FSC/SSC were used to discern single cells from doublets/multiple cells. Samples without flourescent staining were used to 
establish boundaries between negative and positive cells. 


Tick this box to confirm that a figure exemplifying the gating strategy is provided in the Supplementary Information. 
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Xist represents a paradigm for the function of long non-coding RNA in epigenetic 
regulation, although how it mediates X-chromosome inactivation (XCI) remains 
largely unexplained. Several proteins that bind to Xist RNA have recently been 
identified, including the transcriptional repressor SPEN’ °, the loss of which has been 
associated with deficient XCI at multiple loci? °. Here we show in mice that SPEN isa 
key orchestrator of XCl in vivo and we elucidate its mechanism of action. We show 
that SPEN is essential for initiating gene silencing on the X chromosome in 
preimplantation mouse embryos and in embryonic stem cells. SPEN is dispensable for 
maintenance of XCl in neural progenitors, although it significantly decreases the 
expression of genes that escape XCI. We show that SPEN is immediately recruited to 
the X chromosome upon the upregulation of Xist, and is targeted to enhancers and 
promoters of active genes. SPEN rapidly disengages from chromatin upon gene 
silencing, suggesting that active transcription is required to tether SPEN to 
chromatin. We define the SPOC domain as a major effector of the gene-silencing 
function of SPEN, and show that tethering SPOC to Xist RNA is sufficient to mediate 
gene silencing. We identify the protein partners of SPOC, including NCOR/SMRT, the 
m°A RNA methylation machinery, the NuRD complex, RNA polymerase II and factors 
involved in the regulation of transcription initiation and elongation. We propose that 
SPEN acts as a molecular integrator for the initiation of XCI, bridging Xist RNA with the 


transcription machinery—as well as with nucleosome remodellers and histone 
deacetylases—at active enhancers and promoters. 


To assess the importance of SPEN during the initiation of XCI, we used 
an auxin-inducible degron (AID)’ that enables controlled and acute 
depletion of the endogenous SPEN protein. We used our previously 
described female hybrid (Mus musculus castaneus x C57BL/6) TX10728 
mouse embryonic stem cells (ES cells), in which a doxycycline (DOX)- 
inducible promoter upstream of the endogenous Xist locus enables 
conditional Xist RNA expression and XCI (Fig. 1a). In ES cells expressing 
the Oryza sativa TIR1 (OsTIR1) E3 ligase, we generated a homozygous 
knock-in that expressed the AID fused to a HaloTag at the C terminus 
of endogenous SPEN, in order to ensure auxin-dependent SPEN deple- 
tion (Extended Data Fig. 1a). Efficient degradation of SPEN occurred 
within 1h of auxin treatment (Fig. 1b, Extended Data Fig. 1b, Supple- 
mentary Fig. 1) whereas the removal of auxin led to rapid recovery of 
SPEN (Fig. 1b), demonstrating potent AID-dependent modulation of 
SPEN levels. 

To evaluate the immediate consequences of the loss of SPEN on the 
initiation of XCI, we acutely depleted SPEN for 4h before inducing Xist 
expression for 24 hand performing RNA sequencing. Loss of SPEN had 


no effect on the formation of Xist RNA clouds (Extended Data Fig. 1c, e), 
confirming that SPEN is dispensable for Xist localization” >. However, 
gene silencing was almost completely abolished along the entire X 
chromosome in the absence of SPEN (Fig. 1c, d, Supplementary Table 1), 
whereas auxin had no effect on XCI in wild-type cells (Extended Data 
Fig. 1d). Clustering analysis highlighted three groups of genes that 
differed in their silencing defects upon the loss of SPEN (Fig. le). Most 
X-linked genes (80% of 382) were found to be entirely dependent on 
SPEN for silencing, whereas only asmall subset (6%) showed unaltered 
silencing in the absence of SPEN. This notable defect in XCI was con- 
firmed by pyrosequencing (Fig. 1f) and nascent RNA fluorescence in situ 
hybridization (FISH) (Extended Data Fig. le). 

We next assessed the requirement for SPEN in XCl in vivo during 
mouse early embryogenesis, using allele-specific RNA sequencing 
in embryonic day (E)3.5 Spen-knockout female embryos’ harbouring 
hybrid X chromosomes (Fig. 1g, Extended Data Fig. If, g). At this stage 
in wild-type embryos, imprinted XCI has taken place” and only the 
paternal X chromosome is inactivated (Fig. 1h, Extended Data Fig. 1h). 
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Fig. 1| SPEN mediates gene silencing across the entire X chromosome 

in vitro and in vivo. a, Schematic of SPEN-degron Xist-inducible mouse ES cells. 
Xa, active X chromosome; Xi, inactive X chromosome. b, Western blot showing 
auxin-induced degradation of endogenous HaloTagged SPEN. This experiment 
was repeated at least twice with similar results. c,d, Heat map (c) and violin 
plots (d) showing X-chromosomal transcript allelic ratios after 0 h, 24h DOX or 
24hDOX + auxin treatment in SPEN-degron mouse ES cells (n= 434 genes, two- 
sided Student’s t-test). e, Box plot representation of gene-silencing defect upon 
SPEN loss inthree groups of genes differing by their level of dependence on 
SPEN for Xist-mediated silencing. The pie chart shows the relative number of 
genes in each group. f, Pyrosequencing assay of seven X-linked transcripts in 


In maternal-zygotic Spen knockouts, imprinted XCl is severely hindered 
although paternal Xist is expressed. Both maternal and paternal X chro- 
mosomes are expressed equally, phenocopying Xist-knockout E3.5 
embryos” (Fig. 1h, Extended Data Fig. 1g, h, Supplementary Table 2). 
Amaternal-only Spen knockout has no effect on imprinted XCI (Fig. 1h), 
suggesting that the zygotic pool of SPEN is necessary and sufficient for 
this process. Therefore, the early gene-silencing mechanism(s) involved 
inimprinted and random XClare dependent on SPEN. 

We next assessed precisely when SPEN is recruited during XCI. 
HaloTag labelling" of SPEN combined with Xist RNA FISH revealed that 
SPEN associates with Xist RNA rapidly upon Xist coating and throughout 
XCI (Fig. 2a). To capture early Xist-SPEN dynamics during the short 
time window in which Xist becomes upregulated, we followed both 
Xistand SPEN in living cells. We tagged endogenous SPEN with GFPina 
background in which Xist RNA is visualized via a BgIG-mCherry fusion 
protein binding to Bgl stem-loops inserted within Xist” (Extended 
Data Fig. 2a, b). Live-cell imaging revealed that SPEN colocalizes with 
Xist from the very onset of Xist upregulation (Extended Data Fig. 2c, d, 
Supplementary Video 1). Therefore, SPEN can initiate gene silencing 
immediately upon Xist coating. 

We also found that SPEN robustly accumulated on the inactive X 
chromosome after differentiation into neural progenitor cells (NPCs, 
Fig. 2b), in which XCl is epigenetically maintained. The depletion of 
SPEN for up to two days in independent NPC clones (Fig. 2c) did not 
lead to reactivation of fully silenced genes (Fig. 2d, Supplementary 
Table 3); however, we observed moderate but significant upregulation 
of genes escaping XCI (Fig. 2e, f), which suggests that SPEN buffers the 
overexpression of X-linked escapee genes in female cells. 

Chromosome conformation capture has revealed that, in differenti- 
ated cells, the inactive X chromosomeis folded into megadomains® 
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mouse ES cells after 0h, 24 hDOX or 24 hDOX + auxin treatment. Datainc-fare 
averages of two independent clones; inf, individual data points are shown. 

g, The mouse crossbreeding scheme for the Spen-knockout experiment. KO, 
knockout; WT, wild type; mat., maternal; zyg., zygotic; Xm, maternal X 
chromosome; Xp, paternal X chromosome. h, X-chromosomal transcript allelic 
ratio distribution (n=256 genes) in wild-type (n= 2), maternal-only (M) Spen- 
knockout (n=3), maternal-zygotic (Z) Spen-knockout (n=5), and Xist-knockout 
E3.5 embryos (n=30 single cells, two-sided Wilcoxon rank-sum test. For * refer 
toref.'°). Ind, e, h, horizontal lines denote the median, box limits correspond to 
the upper and lower quartiles. 


and is globally depleted of topologically associating domains except in 
regions that contain clusters of escapee genes”. Xist RNA has been found 
to have a role in the conformation of the inactive X chromosome". 
To assess whether SPEN is involved, we performed allele-specific 
Hi-C in NPCs after 48 h of SPEN depletion. No notable conformational 
changes were observed onthe inactive X chromosome (Extended Data 
Fig. 2e—g); we therefore conclude that the structural effects mediated 
by Xist RNA in differentiated cells occur independently of SPEN. 

In summary, our data suggest that SPEN exerts its role by actively 
promoting gene silencing during the earliest stages of XCI. However, 
it has no major role in stabilizing the transcriptionally inactive state 
of the inactive X chromosome, or in ensuring the maintenance of its 
conformation. 

We next sought to identify which parts of SPEN ensure its function 
during XCI. SPEN is avery large protein (around 400 kDa) that contains 
four RNA recognition motifs (RRMs), a nuclear receptor interaction 
domain (RID) and a SPEN paralogue/orthologue C-terminal (SPOC) 
domain (Fig. 3a). We overexpressed a series of SPEN complementary 
DNA truncations, stably targeted into the Rosa26 locus in the SPEN- 
degron mouse ES cell line (Extended Data Fig. 3a, b, Fig. 3a). We then 
induced Xist expression for 24 h and assessed which SPEN fragments 
could rescue XCI-initiation function in the context of auxin-mediated 
depletion of endogenous SPEN. We found that the RRMI1 domain and the 
RID are dispensable for SPEN accumulation on the inactive X chromo- 
some, as well as for X-linked gene silencing (Fig. 3b, c). By contrast, a 
SPEN truncation lacking the RRM2-4 domains failed to accumulate on 
the inactive X chromosome and failed to rescue XCI (Fig. 3b, c). SPEN 
recruitment to the inactive X chromosomeis therefore mediated by the 
RRM2-4 domains and is necessary for gene silencing. This is consistent 
with studies showing that RRM2-4 directly bind the A-repeat of Xist 
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Fig. 2|SPEN localizes to the X chromosome immediately upon Xist 
upregulation and throughout the stages of XCI, but is dispensable for the 
maintenance of X-linked gene silencing. a, b, Images from combined Halolag 
labelling of SPEN (green) and FISH for Xist RNA (red) in mouse ES cells during a 
time course of Xist induction (a) and in NPCs (b). Scale bars, 5 um.c, Schematic 
of the SPEN-degron experimentin NPCs. d, Cumulative distribution of 
transcript allelic ratios across the X chromosome (n= 387 genes) after SPEN 


RNA in vitro*"*—a region of Xist that is necessary for gene silencing”. 
Conversely, a truncation of the SPOC domain enabled efficient SPEN 
accumulation on the inactive X chromosome, but failed to rescue XCI 
(Fig. 3b, c). To validate this observation, we performed homozygous 
deletion of the SPOC domain at the endogenous Spen locus in mouse 
ES cells (Extended Data Fig. 3c). Deletion of the SPOC domain had no 
effect either on SPEN recruitment to the inactive X chromosome or 
on Xist RNA clouds (Extended Data Fig. 3d-f), but resulted in strongly 
deficient XCI, albeit milder than that in SPEN-depleted cells (Extended 
Data Fig. 3g-j). Collectively, these results demonstrate that the 
SPOC domain is essential for XCI. However, other uncharacterized 
regions of SPEN contribute—albeit to a lesser extent—to ensure its full 
silencing potential. 

To test whether the SPOC domain alone could mediate X-linked gene 
silencing, we used SPEN-degron ES cells to introduce an array of Bgl 
stem-loops at the Xist locus (identical to the live-imaging strategy). 
In this background, we generated several independent ES cell lines 
expressing a BgIG-GFP-SPOC protein fusion (or BgIG-GFP as a con- 
trol) targeted into Rosa26. These proteins would become tethered to 
Xist-Bgl stem-loop RNA via BglG (Fig. 3d). Notably, upon induction of 
Xist RNA inthe absence of endogenous SPEN, tethering of BgIG-GFP- 
SPOC (but not of BgIG-GFP alone) resulted in substantial gene silencing 
across the X chromosome, with over half of the genes being silenced by 
more than 50% (Fig. 3e, f). SPOC-specific rescue was confirmed using 
pyrosequencing (Fig. 3g). Consistent with previous studies”, our 
results reveal SPOC as a key domain of SPEN that enables gene silencing 
once recruited to the X chromosome by Xist RNA. 

The SPOC domain of SPEN was originally identified as an interactor 
of the NCoR and SMRT corepressors in human cells'*”"””. Given that 
NCoR and SMRT interact with and activate HDAC3”, it was proposed 
that SPEN triggers XCI via HDAC3’, the activity of which is important 
for Xist-mediated silencing”. However, XClis more markedly affected 
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depletionin NPCs. e, f, Cumulative distribution (e) and violin plot 
representation (f) of the transcript allelic ratio of escapees after SPEN 
depletion in NPCs (n= 65, two-sided Wilcoxon signed-rank test. NS, not 
significant. Horizontal lines denote the median, box limits correspond to 
upper and lower quartiles). Datain d-fare the average of two independent NPC 
clones. The experiments ina, b were repeated at least twice with similar results. 


upon the loss of SPEN and SPOC than upon the loss of HDAC3 (Extended 
Data Fig. 3j, k). These observations suggest that a model involving 
HDACS3 only partially explains the function of SPEN, and that SPOC 
must exert its key role in gene silencing also through other, HDAC3- 
independent pathways. To identify such pathways, we characterized 
the protein interactome of the SPOC domain by performing GFP 
pull-downs from mouse ES cells that stably expressed BgIG-GFP-SPOC 
(or BgIG-GFP as a control, Fig. 3h, Supplementary Table 4), followed 
by mass spectrometry analysis. 

We identified NCoR and SMRT as expected, but we also found 
HDAC3 (Fig. 3h, Extended Data Fig. 31), which further supports the 
proposed model for the function of SPEN in XCI’. Notably, we identi- 
fied the m°A methyltransferase complex and the m°A reader YTHDC1 
(Fig. 3h, Extended Data Fig. 31), which have been proposed to play a 
role in XCP”°”5, One of these factors, WTAP, co-purified with Xist RNA 
in an A-repeat-dependent manner?—although, contrary to the case of 
SPEN, adirect interaction between WTAP and Xist A-repeat has not been 
reported. Our results therefore suggest that SPOC may participate in 
the recruitment of m°A machinery to Xist RNA. We also identified the 
NuRD complex~a potent repressor that displaces RNA polymerase II 
(RNAPII) from transcription start sites through chromatin remodel- 
ling”°—and RNAPII, together with factors that are involved in the regula- 
tion of transcription initiation and elongation (Fig. 3h, Extended Data 
Fig. 31). Together these findings show that, through its SPOC domain, 
SPEN bridges Xistto multiple factors that are involved in transcription 
and chromatin regulation, and together they mediate efficient gene 
silencing. Given that SPOC immunoprecipitation was performed inthe 
absence of Xist induction, the identified interactions are not mediated 
by Xist RNA. 

Wealso investigated where SPEN binds to the X chromosome during 
XCI, and whether it has distinct binding sites or whether it associates 
with chromatin diffusely across the entire chromosome, as anticipated 
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Fig. 3 | The SPOC domain of SPEN mediates gene silencing and interacts with 
multiple molecular pathways. a, Spen cDNA fragments used for the rescue 
experiment. b, Heat map representation of four X-linked transcript allelic 
ratios (obtained by pyrosequencing) in control, 24h DOX- and 24hDOX+36h 
auxin-treated SPEN-degron mouse ES cells overexpressing each cDNA 
construct. Data represent averages of two to three independent clones. 

c, Immunofluorescence detection of Flag-tagged SPEN truncations (green) 
and H2AK119ub1 (red), a marker of the inactive X chromosome, in SPEN-degron 
mouse ES cells treated with DOX and auxin. Scale bar, 5 um. d, Schematic 
showing the tethering of BgIG-SPOC to Xist. e, Distribution of gene-repression 
scores observed across the X chromosome upon the depletion of endogenous 
SPEN and the tethering of BgIG-GFP (green) or BgIG-GFP-SPOC (orange) to 
Xist.f, Bar graphs showing the fraction of X-linked genes within four windows of 
repression score. g, Transcript allelic ratio (obtained by pyrosequencing) for 
four X-linked genes upon the depletion of endogenous SPEN and the tethering 
of BgIG-GFP or BgIG-GFP-SPOC to Xist (*P< 0.01, two-sided Student’s f-test). 
h, Western blot showing co-immunoprecipitated proteins in BgIG-GFP and 
BgIG-GFP-SPOC immunoprecipitation experiments. One per cent of the input 
was loaded (0.1% for RNAPII), and 10% of the pull-down. The experiments in 

c, hwere repeated at least twice with similar results. The data in e-g are the 
average of four independent clones. 


from our imaging results. We performed allele-specific, cross-linked 
CUT&RUN” experiments on SPEN during a time course of Xist induc- 
tion (Oh, 4h, 8h, 24h DOX, or 8h DOX + auxin as a negative control). 

We found that there are few binding sites for SPEN across the genome 
of uninduced ES cells (Extended Data Fig. 4a). Conversely, hundreds 
of SPEN-binding sites appeared specifically on the X chromosome as 
early as 4 h after Xist induction (Fig. 4a, Extended Data Fig. 4a). This is 
consistent with imaging data (Extended Data Fig. 2). We note that SPEN 
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accumulation is seen across the gene body of Xist (Fig. 4b), suggesting 
that SPEN binds Xist RNA while it is transcribed. In sharp contrast tothe 
Xist locus, SPEN shows focal binding on the rest of the genome, with 
peaks falling almost exclusively on promoters and enhancers (Fig. 4c, 
d, Extended Data Figs. 4b, Sa-g). 

After Xist induction, recruitment of SPEN to the inactive X chromo- 
some reaches a maximum at 4h (Fig. 4a, Extended Data Fig. 4c), showing 
the highest enrichment within regions that were coated earliest by Xist® 
(entry sites, Fig. 4e). SPEN accumulation thus follows the spatial dynam- 
ics of Xist spreading. Among promoter targets onthe X chromosome, 
SPEN preferentially binds those of actively expressed genes (Fig. 4f, 
Extended Data Fig. 4d), which suggests that the ability of SPEN to target 
chromatin depends ontranscriptional activity. Consistently, genes that 
are classified as fully dependent on SPEN for silencing (Fig. le)—which 
showa greater degree of SPEN binding at their promoters within 4 h of 
Xist coating than less-dependent genes (Extended Data Fig. 4e)—also 
show initially higher transcription levels (Fig. 4g). 

Furthermore, within 4h of Xistinduction, SPEN binding is greater at 
the promoters of efficiently silenced genes than at the promoters of 
less-efficiently silenced genes (Fig. 4h). Similarly, upon Xist coating, 
efficiently deacetylated enhancers” showa higher enrichment of SPEN 
than less-efficiently deacetylated enhancers (Fig. 4i). Finally, genes that 
are subject to very little silencing—or those that completely escape XCI 
in our Xist-inducible system—show a significantly lower SPEN signal at 
their promoters (Extended Data Figs. 4f, g, 5h—-n). This pattern of SPEN 
recruitment at discrete sites to the X chromosome that is undergoing 
XClindicates that transcriptional silencing is caused by the binding of 
SPEN to active promoters and enhancers. 

To understand how SPEN might function at enhancers and promot- 
ers, we integrated CUT&RUN profiles with publicly available data from 
chromatin immunoprecipitation followed by sequencing (ChIP-seq) 
experiments for transcription and chromatin-associated factors 
identified in our mass spectrometry analysis. We included HDAC3“, 
RNAPII’ and two members of the NURD complex (MBD3 and CHD4)”*. 
SPEN binding strongly overlaps with HDAC3 at enhancers but not at 
promoters (Extended Data Fig. 4h). Our recent findings revealed that 
HDAC3 is pre-bound predominantly at enhancers on the X chromo- 
some”. Therefore, Xist-mediated recruitment of SPEN to enhancers 
may activate HDAC3. Conversely, a strong overlap with SPEN bind- 
ing is observed for the NuURD complex specifically at promoters but 
not at enhancers (Extended Data Fig. 4h). Furthermore, SPEN peaks 
extensively overlap with RNAPII phosphorylated on serine 5, whichis 
associated with transcription initiation (Extended Data Fig. 4h). This 
analysis suggests that SPEN may operate at enhancers and promoters 
through distinct pathways to promote gene silencing. 

Notably, the binding of SPEN to chromatin decreases across the whole 
X chromosome after 24 h of Xist induction (Fig. 4a, Extended Data 
Fig. 4c). Clustering of CUT&RUN profiles at SPEN-bound promoters 
(Extended Data Fig. 4i, Supplementary Table 5) revealed distinct groups 
of promoters, grouped on the basis of how efficiently SPEN was lost 
within 24 h of XCI (Fig. 4j). Inthe ‘strong SPEN loss’ group, binding was 
maximal by 4h but decreased after 8h, and even more markedly after 24 
h (Fig. 4j). Conversely, the ‘mild SPEN loss’ group showed maximal and 
persistent SPEN binding at 4 h and 8 h of Xist induction, respectively, 
with only a mild reduction of SPEN binding by 24 h (Fig. 4j). Finally, a 
third group—comprising fewer promoters—showed both mild SPEN 
enrichment at 4 hand lowloss at 24 h (Fig. 4j). The group that lost SPEN 
most efficiently also showed the most pronounced gene silencing by 
24 h when compared with the groups that significantly retained SPEN 
(Fig. 4k, Extended Data Fig. 4j). Altogether this analysis suggests that, 
once recruited to the X chromosome by Xist RNA, SPEN associates 
with enhancers and promoters in a transcription-dependent manner. 
This recruitment leads to gene silencing, after which the favourable 
transcriptional context for SPEN binding is lost, and SPEN binding 
to chromatin decreases. Despite loss of the chromatin-bound SPEN 
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Fig. 4| SPENis recruited by Xistto active gene promoters and enhancers, 
whereit silences transcription and subsequently disengages from 
chromatin. a, SPEN allele-specific accumulation (obtained from CUT&RUN 
experiments) on peaks at autosomes (grey, n= 948) and onthe X chromosome 
(green, n= 635) after Oh, 4h, 8hand 24h of Xist induction in mouse ES cells. 
Shownare average allelic-ratios (shading is the interquartile range) of all peaks. 
b, UCSC Genome Browser allele-specific track showing SPEN binding around 
Xist.c, Annotation of SPEN peaks on the X chromosome. d, UCSC Genome 
Browser allele-specific track showing SPEN binding around Edaz2r, an X-linked 
gene. Inb, d, blue denotes Cast-Xa; red denotes B6-Xi; tracks are scaled 
identically. e, Box plot showing SPEN enrichment at 4 hin peaks outside or 
within Xistentry sites. f, Violin plot showing gene expression (reads per 
kilobase per million reads, RPKM) of genes accumulating SPEN (n= 289) or not 


fraction, persistent Xist RNA expression and coating ensure that SPEN 
remains strongly accumulated around the inactive X chromosome 
(Fig. 2a, b). 

Our study demonstrates that SPEN is acrucial factor that collaborates 
with XistRNA to initiate gene silencing across the X chromosome, both 
during XClin vitro and imprinted XClin vivo. SPEN becomes dispensa- 
ble for maintaining gene silencing after XCI has been established, but 
partially represses escapees, which suggests that Xist may havea silenc- 
ing role evenin somatic cells. Although SPEN coats the X chromosome 
immediately upon Xist induction, it contacts chromatin only at active 
promoters and enhancers, which serve as substrates for SPEN-mediated 
gene silencing. SPEN association with chromatin is favoured by active 
transcription, as SPEN disengages from chromatin when X-linked genes 
become silenced. We identify the SPOC domain of SPEN as a potent 
transcriptional repressor, which is crucial for SPEN-dependent XCI. 
On the basis of our mass spectrometry analysis, we propose that the 
SPOC domain is key for bridging Xist with other factors implicated 
in XCIl—such as HDAC3—which we find to be present at most X-linked 
enhancers to which SPEN is recruited. In particular, the interaction of 
the SPOC domain with the NuRD complex and the transcription machin- 
ery points to a role for SPEN in direct transcriptional repression. We also 
identify SPOC as an interactor of the m°A methyltransferase complex, 
which has a role in Xist RNA methylation, a modification that is impor- 
tant for Xist-dependent silencing”. Methylation of Xist is mediated 
by RBM15”, which interacts with the m°A machinery directly through 
ZC3H13”—the most highly enriched m°A machinery factor identified 
in our mass spectrometry experiments. Because RBM15 also carries a 
SPOC domain, our study raises the possibility that the interaction with 
the RNA methylation machinery is not restricted solely to the SPOC 
domain of SPEN, but may instead be a feature that is shared across 
SPOC-containing proteins. 


accumulating SPEN (n=2,325) at their promoters. g, Violin plot showing gene 
expression levels (RPKM in control conditions) of genes grouped on the basis 
of their level of dependence on SPEN for gene silencing (see Fig. le). h, i, Box 
plots showing SPEN enrichment after 4 h of Xistinduction within peaks at 
promoters grouped onthe basis of how efficiently their respective genes are 
silenced (h) or at enhancers grouped on the basis of how efficiently they are 
deacetylated during XCI (i).j, k, Box plots showing normalized SPEN 
enrichment at promoters (j) and gene silencing (transcript allelic ratio) during 
XCI(k) within 3 groups of X-linked genes showing different dynamics of SPEN 
accumulation and loss (n= 86 strong loss, n=92 mild loss, n=39 low SPEN). In 
e-i, the two-sided Wilcoxon rank-sum test was used; in e-k, horizontal lines 
denote the median, box limits correspond to upper and lower quartiles. 


SPEN binds other non-coding RNAs, including SRA’, which is involved 
in steroid-receptor regulation. Furthermore, another SRA-binding pro- 
tein—SLIRP—has been shown to bind promoters in an SRA-dependent 
manner”’; this raises the possibility that, similarly to Xist, SRA could 
guide SPEN to target gene regulatory elements. 

In conclusion, our study suggests that RNA-mediated recruitment 
of SPEN and other SPOC-containing proteins—which are found across 
fungi, plants and animals—may be a widespread means by which to 
acutely repress transcription by co-ordinately engaging several lay- 
ers of epigenetic and transcriptional control. We propose that SPEN 
bridges Xistto the transcription machinery, histone deacetylases and 
chromosome remodelling factors to ensure robust and efficient XCI 
(Extended Data Fig. 4k). 
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Methods 


Data reporting and statistical analysis 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and the investigators were not 
blinded to allocation during experiments and outcome assessment. 
All statistical tests, resulting P values and observation numbers are 
indicated in the figure panels or in the figure legends. 


Data visualization 

All heat maps, violin plots, box plots, density plots, bar graphs and pie 
charts were generated using ggplot2. Unless stated otherwise, box 
plots always show the median as the centre line, box limits correspond 
to upper and lower quartiles, and whiskers cover 1.5 the interquartile 
range. 


Plasmid construction 
The plasmids to target OsTIR1 at the 7/GRE locus (Addgene plasmid 
92141) and the TIGRE-specific guide-RNA-encoding plasmid (Addgene 
plasmid 92144) were provided by E. Nora. The additional 7/GRE-target- 
ing plasmids BgIG-mCherry-T2A-OsTir1 (pFD51) and rtTa-VP16-T2A- 
OsTir1 (pFD68) were cloned using PCR amplification of corresponding 
gene cassettes followed by traditional cloning into the 92141 backbone. 
Targeting constructs (pFD19 and pFD449) to tag endogenous SPEN 
at its C terminus with AID-HaloTag and AID-GFP, respectively, were 
generated as follows: 500-bp homology arms (flanking both sides of, 
but excluding the stop codon of Spen) were amplified from mouse 
genomic DNA by PCR. One-step Gibson cloning (New England Biolabs) 
was subsequently used to simultaneously surround the digested AID 
insert (carrying a puromycin-resistance gene under the control of the 
PGK promoter) in frame with the homology arms and clone the insert 
into the pBR322 vector. Synonymous mutations in the PAM/SEED target 
sequence (located on the 5S’ homology arm) were then introduced using 
the QuickChangell XL site-directed mutagenesis kit (Agilent) to prevent 
Cas9-mediated cutting of the targeting vector upon transfection and 
of the AID-tagged allele(s) upon integration. The targeting construct 
(pFD90) to replace the endogenous SPOC domain of SPEN by GFP was 
generated using the same strategy. For guide RNA cloning, the pX459 
plasmid (a gift from F. Zhang, Addgene 62988) encoding Streptococcus 
pyogenes Cas9 was digested with BbsI immediately downstream of the 
U6 promoter, and annealed DNA duplexes corresponding to the target 
guide RNA sequences were ligated. 


Cell culture 
Mouse XX ES cells (TX1072) were grown on 0.1% gelatin-coated flasks in 
8% CO,37 °Cincubators. For all experiments, cells were cultured in 2i+ 
LIF, and batch-tested fetal calf serum ES cell medium -— DMEM (Sigma), 
15% FBS (Gibco), 0.1 mM B-mercaptoethanol, 1,000 U ml“ leukaemia 
inhibitory factor (LIF, Chemicon), CHIR99021 (3 uM), PDO325901 
(1M). 

NPC differentiations and subcloning were performed as previously 
described’. NPCs were grown in N2B27 medium supplemented with 
EGF and FGF (10 ng mI‘ each), on 0.1% gelatin-coated flasks. 


Cell transfection and clone isolation 

All transgenic insertions were performed using the 4D nucleofector 
system from Lonza. For each nucleofection, five million cells were 
electroporated with 2.5 pg each of non-linearized targeting vectors 
and guide RNA-Cas9 encoding plasmids (MidiPreps). Nucleofected 
cells were then serially diluted and plated on 10-cm dishes. Forty-eight 
hours later, antibiotic selection was performed (puromycin, 0.4 pg mI"; 
hygromycin, 250 pg mI; blasticidin, 5 ug mI”), except for transfection 
steps involving flippase-mediated removal of resistance cassettes, 
during which no selection was applied. One week after the initial plat- 
ing, 80 to 96 single colonies were picked from dishes showing ideal 


clonal density and seeded in 96-well plates. These cells were subse- 
quently split into one high-confluency plate used for PCR genotyping, 
and one low-confluency plate from which desired clones were further 
expanded until T25 density was reached. At this stage, some cells were 
kept to reconfirm the correct genotype by PCR, while the remaining 
cells were frozen. 


Cell treatments 

Xist expression in TX1072 mouse ES cells was induced upon admin- 
istration of doxycycline (1 pg ml). Auxin-mediated depletion of 
target proteins was achieved by supplementing culture media with 
auxin (Sigma) at the recommended concentration of 500 uM. Auxin- 
containing medium was renewed every 24 h. For auxin wash-out, auxin- 
containing medium was removed, cells were rinsed once with PBS, and 
exposed to auxin-free medium. 


Protein extraction and western blotting 

Cells were trypsinized, washed once in medium and once in PBS and 
then pellets were immediately frozen at —80 °C. Pellets were then 
resuspended in RIPA buffer (5O mM Tris-HCI pH 8.0-8.5, 150 mM NaCl, 
1% Triton X-100, 0.5% sodium deoxycholate, 0.1% SDS) containing pro- 
tease inhibitors (Roche), incubated for 30 min onice and sonicated 
with a Bioruptor (three 10-s pulses). Lysates were then centrifuged 
for 20 min at 4 °C, and supernatants were kept. Protein concentration 
was determined using the Bradford (BioRad) assay. Samples were then 
boiled at 95 °C for 10 min in LDS buffer (Thermo) containing 200 mM 
DTT. For all western blots except those aimed at detecting SPEN, 4-12% 
Bis-Tris gels were used. For the detection of SPEN, a high-molecular- 
weight protein (>400 kDa), 3-8% tris-acetate polyacrylamide gels were 
used. Transfer was performed ona 0.45-um nitrocellulose membrane 
using a wet-transfer system, at 350-400 mA for 2 hat 4 °C. 


RNA extraction, reverse transcription, pyrosequencing and 

RNA sequencing 

RNA extraction was performed using the RNeasy kit and on-column 
DNase digestion (Qiagen). Reverse transcription was performed on 
1pg total RNA using SuperScript III (Life Technologies). To quantify 
allelic skewing, cDNA was amplified using biotinylated primers and 
subsequently sequenced using Q24 Pyromark (Qiagen). Only sam- 
ples showing a RNA integrity number greater than 9 were used to 
prepare RNA sequencing (RNA-seq) libraries (TruSeq). Paired-end 
100-nt sequencing was performed on a HiSeq2500 or NovaSeq6000 
(Illumina). 


RNA FISH 

Cells were dissociated using Trypsin (Invitrogen) for ES cells or Accutase 
(Invitrogen) for NPCs, washed twice in medium, and allowed to attach 
on poly-L-lysine (Sigma)-coated coverslips for 10 min. Cells were fixed 
with 3% paraformaldehyde in PBS for 10 min at room temperature, 
washed in PBS three times, and permeabilized with ice-cold permea- 
bilization buffer (PBS, 0.5% Triton X-100, 2 mM vanadyl-ribonucleo- 
side complex) for 5 min on ice. Coverslips were stored in 70% ethanol 
at -20 °C. Samples were dehydrated in 4 baths of increasing ethanol 
concentration (80%, 95%, 100% twice) and air-dried quickly. Probes 
were prepared from minipreps of intron-spanning bacteria artificial 
chromosomes (BACs) (clone RP24-157H12 for Huwe1, RP23-260115 for 
Atrx) or plasmid (p510 for Xist). Probes were labelled by nick translation 
(Abbott) using dUTP labelled with spectrum green (Abbott) for Huwel, 
spectrum red (Abbott) for Atrx, and Cy5 (Merck) for Xist. Labelled BAC 
probes were co-precipitated with Cot-1 DNA repeats in the presence 
of ethanol and salt, resuspended in formamide, denatured at 75 °C for 
10 min, and competed at 37 °C for Lh. Probes were then co-hybridized 
in FISH hybridization buffer (50% formamide, 20% dextran sulfate, 2X 
SSC, 1 pg pl BSA, 10 mM vanadyl-ribonucleoside) at 37 °C overnight. 
The next day, hybridized coverslips were washed three times for 5 min 
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with 50% formamide in 2X SSC at 42 °C, and three times for 5 min with 
2X SSC. DAPI (0.2 mg mI") was added to the penultimate wash and 
coverslips were mounted with Vectashield (Vectorlabs). 


HaloTag labelling 

HaloTag labelling of the SPEN-Halo fusion protein was performed in 
live TX1072 ES cells and NPCs. Cells were labelled with HaloTag-ligand- 
conjugated Janelia Fluor” (JF646-HaloTag or JF549-HaloTag, a gift 
from L. Lavis) at a final concentration of 250 nM in culture medium. 
Labelling was performed for 1h at 37 °C, cells were then washed 
4 times with generous volumes of PBS, and incubated with unlabelled 
medium for 15 min before proceeding with downstream experiments. 
For NPC labelling, cells were washed with unlabelled medium and not 
PBS, because NPCs detach when exposed to PBS. Auxin and/or doxy- 
cycline were kept in the labelling medium when necessary. 


HaloTag labelling followed by RNA FISH 

For co-detection of SPEN-Halo and Xist RNA, cells were labelled with 
JF549 as indicated above, and directly processed for fixation and per- 
meabilization as detailed in the section ‘RNA FISH’. Importantly, after 
permeabilization, coverslips were directly washed twice with PBS, twice 
with 2X SSC and immediately processed for FISH. 


Mouse breeding, embryo collection and single-embryo RNA-seq 
Timed natural matings were used for all experiments. Noon of the day 
when the vaginal plugs of mated females were identified was scored 
as EO.5. For Spen matings a conditional allele was used’. For oocyte 
deletions the published Rosa26:Zp3-Cre allele was used”. F, hybrid 
Spen’’ males were obtained by crossing Spen** CAST/EiJ females with 
Spen’’ C57BL/6) males. For Spen maternally deleted embryos, Sper” 
Zp3-Cre’” C57BL/6) females were crossed with Spen’” F, hybrid males. 
For Spencontrol embryos, Spen”"« Z7p3-Cre”’ C57BL/6] females were 
crossed with Spen’” F, hybrid males. The care and use of animals in 
this study was performed in accordance with the recommendations 
of the European Community (2010/63/UE). All experimental proto- 
cols were approved by the ethics committee of Institut Curie CEEA- 
1C118 under the number APAFIS#8812-2017020611033784v2 given by 
national authority in compliance with the international guidelines. 
Single-embryo RNA-seq was performed as previously described™. In 
brief, E3.5 embryos were collected and morphologically assessed to 
ensure that only viable samples were collected. The zona pellucida 
was removed by treatment with acidified Tyrode’s solution. Single 
embryos were picked into individual tubes and cDNA was prepared and 
amplified as previously described™. Illumina libraries were prepared 
as published in ref. **. Paired-end 100-nt sequencing was performed 
with HiSeq2500 (Illumina). 


Live-cell imaging and analysis 

Cells were seeded on fibronectin-coated 35-mm glass-bottom dishes 
(Ibidi) 24 h before imaging. Doxycycline was added 1h before image 
acquisition. Cells were imaged on the DeltaVision OMX microscopein 
widefield mode (GE Healthcare) using a 1.4 numerical aperture 100x 
oilimmersion objective. The temperature was controlled at 37 °C and 
CO, at 8% during acquisition. 

Images were acquired as z-stacks of 40 slices with 400-nm steps 
every 10 min for at least 4 h. Movies were deconvolved using Huygens 
deconvolution with the following parameters: Iteration 4; S/N5, 10; 
quality threshold 0.1; and widefield mode 0.7 was used for background 
estimation. Two channels were registered using TetraSpec micro- 
spheres 0.1 pm (Invitrogen) and unwarp) (Fiji plug-in). For segmenta- 
tion, z-projected deconvolved registered images were used and pixels 
were classified as cloud or nuclei using Ilastik. Touching nuclei were 
sometimes manually separated. Cut-offs on resulting probability maps 
were set to 0.7. We next performed connected component analysis to 
obtain integer-labelled images in which each integer label corresponds 


to a unique nucleus. In the tailor-made Fiji plug-in the inputs are the 
raw max z-projected time-lapse images of the two channels and the 
integer labelled time-lapse image of the nuclei. The probability maps 
of the clouds give the region of interest in the time-lapse sequence in 
which total intensity is calculated. In the plug-in, clouds are associated 
with their corresponding nuclei; they are then linked via Kalman filter 
tracker over time. These unique links constitute track IDs and contain 
information about the intensity and area measurements for each cell. 
For each tracked cell, the first time point when a cloud is detected in 
one channel (Xist or SPEN) is labelled as reference time point 1. 


Hi-C 

Hi-C was performed as previously described®, except that ligated DNA 
size selection was omitted, and dA-tailing was performed before biotin 
pull-down. In brief, each Hi-C experiment was performed on 10 mil- 
lion cells (NPCs) per sample. Cells were digested with Dpnll at 37 °C 
overnight. DNA ends were filled with biotin-14-dATP at 23 °C for 4h. 
DNA was then ligated with T4 DNA ligase at 16 °C overnight. Binding 
proteins were removed by treating ligated DNA with proteinase K at 65 
°C overnight. Purified proximally ligated molecules were fragmented 
to obtain an average fragment size of 200 bp. After DNA end repair, 
dA-tailing and biotin enrichment, DNA molecules were ligated to Illu- 
mina TruSeq sequencing adapters at room temperature for 2 h. Final 
library PCR productions were carried out following the Illumina TruSeq 
Nano DNA Sample Prep Kit manual. Paired-end 100-nt sequencing was 
performed ona HiSeq4000 (Illumina). 


Genetic engineering strategy for Xist-Bgl stem-loop tagging 
and SPEN complementation analysis constructs 

To tag Xist with Bgl stem-loops”*, we nucleofected cells with pBS-Ptight- 
Xist-BgISL” (plasmid harbouring 18 repeats of Bgl stem-loops inserted 
between homology arms to target Xist exon 7, carrying aG418 selection 
gene, agift from O. Masui). After G418 selection and FLP-FRT mediated 
removal of the selection cassette, clones were picked and genotyped. 
Positive clones were further tested to ensure that the stem-loop-tagged 
Xist could properly be induced and trigger gene silencing upon addition 
of doxycycline (data not shown). 

Spen cDNA truncations were generated by splicing out differ- 
ent regions of the Spen open reading frame (Genscript, ORF clone 
OMul1416C) using overlap extension PCR. Each Spen truncation was 
cloned downstream of aCAGGS promoter into a vector carrying homol- 
ogy arms for targeted insertion at the Rosa26” locus as well as aSV40- 
promoter driven hygromycin-resistance gene. The BgIG-GFP-SPOC 
targeting plasmid was designed by inserting a translational fusion 
between a BgIG-GFP cassette and SPEN amino acids 3244—3643 into 
thesame Rosa26 targeting vector. Each of these ‘complementation’ con- 
structs were independently targeted at Rosa26 in SPEN-degron mouse 
ES cells. Independent clones were picked and protein expression of each 
SPEN truncation was assessed by western blot. XCI complementation 
analysis was then performed in 2-3 independent clones for Spen cDNA 
truncations, and 4 independent clones for BgIG-GFP-SPOC expressing 
clones. The ability of cells to accumulate BgIG-mCherry, BgIG-GFP 
and BglIG-GFP-SPOC upon the addition of doxycycline was assessed 
using microscopy (data not shown). 


Immunofluorescence 

ES cells were dissociated using trypsin (Invitrogen), washed extensively 
in medium, and allowed to attach on poly-L-lysine (Sigma)-coated cov- 
erslips for 10 min. Cells were then fixed with 3% paraformaldehyde in 
PBS for 10 min at room temperature, washed in PBS three times, and 
permeabilized with 0.25% Triton X-100 in PBS for 5 min at room tem- 
perature. Coverslips were then washed three times in PBS and blocked 
for 1h with blocking buffer (PBS containing 2.5% BSA, 0.1% Tween20 and 
10% normal goat serum). Coverslips were then incubated with primary 
antibodies diluted in blocking buffer at 4 °C overnight, washed three 


times for 5 min in PBST (0.1% Tween20) the next day, incubated with 
fluorescently labelled secondary antibodies (1/500 in blocking buffer) 
for 1h at room temperature, and washed again three times for 5 min 
in PBST. DAPI (0.2 mg mI’) was added to the penultimate wash and 
coverslips were mounted with Vectashield (Vectorlabs). 


Immunoprecipitation 

Nuclear extracts were prepared by resuspending 50 million fresh cellsin 
ice-cold10 ml buffer A (10 mM HEPES pH 7.9, 10 mM KCI, 1.5 mM MgCl, 
0.1% NP-40, cOmplete EDTA free, phosSTOP) and rotating for 10 min at 4 
°C. Nuclei were centrifuged at 800g for 10 min at 4 °C and resuspended 
in1ml IP buffer C150 (20 mM HEPES pH 7.9, 150 mM NaCl, 1.5 mM MgCl,, 
0.2 mM EDTA, 0.25% NP-40, cOmplete EDTA free, phosSTOP). Lysates 
were briefly sonicated followed by Benzonase (Merck) digestion for 
30 min at 4 °C. Finally, lysates were cleared through centrifugation 
at 13,000 rpm for 20 min before being incubated with 15 pl of GFP 
trap magnetic agarose bead slurry (ChromoTek) overnight at 4 °C. 
Beads were washed 5 times in IP buffer. For co-immunoprecipita- 
tion (Co-IP) western blot, washed beads were directly resuspended 
in LDS buffer (Thermo) containing 200 mM DTT, and boiled at 95 °C 
for 10 min. 


Proteomics and mass spectrometry analysis 

Proteins on magnetic beads were washed twice with 100 pl of 25 mM 
NH,HCO, and we performed on-bead digestion with 0.2 pg of trypsin/ 
LysC (Promega) for1hin100 pl of 25 mM NH,HCO;. Samples were then 
loaded onto homemade C18 StageTips for desalting. Peptides were 
eluted using 40/60 MeCN/H,O + 0.1% formic acid and concentrated 
to dryness under vacuum. Online chromatography was performed 
with an RSLCnano system (Ultimate 3000, Thermo Scientific) coupled 
online to a Q Exactive HF-X with a Nanospay Flex ion source (Thermo 
Scientific). Peptides were first trapped on a C18 column (75 um inner 
diameter x 2 cm; nanoViper Acclaim PepMap 100, Thermo Scientific) 
with buffer A (2/98 MeCN/H,0O in 0.1% formic acid) at a flow rate of 2.5 
pl min over 4 min. Separation was then performed ona50cm* 75 um 
C18 column (nanoViper Acclaim PepMap RSLC, 2 um, 100 A, Thermo 
Scientific) regulated to atemperature of 50 °C witha linear gradient of 
2% to 30% buffer B (100% MeCN in 0.1% formic acid) at a flow rate of 300 
nl min“ over 91 min. Mass spectrometry full scans were performed inthe 
ultra-high-field Orbitrap mass analyser over the range m/z 375-1,500 
with a resolution of 120,000 at m/z 200. The top 20 most intense ions 
were subjected to Orbitrap for further fragmentation via high-energy 
collision dissociation activation and a resolution of 15,000 with the 
intensity threshold kept at 1.3 x 10°. We selected ions with charge state 
from 2° to 6* for screening. Normalized collision energy was set at 27 
and the dynamic exclusion at 40 s. For identification, the data were 
searched against the M. musculus (UPOO0000589) Uniprot database 
using Sequest HF through Proteome Discoverer (v.2.2). Enzyme speci- 
ficity was set to trypsin and a maximum of two missed-cleavage sites 
were allowed. Oxidized methionine and N-terminal acetylation were 
set as variable modifications. Maximum allowed mass deviation was 
set to 10 ppm for monoisotopic precursor ions and 0.02 Da for MS/ 
MS peaks. The resulting files were further processed using myProMS*® 
v3.6 (work in progress). Calculation of the false discovery rate used 
Percolator and was set to 1% at the peptide level for the whole study. 
The label-free quantification was performed by peptide extracted 
ion chromatograms (XICs) computed with MassChroQ version 2.2”. 
For protein quantification, XICs from proteotypic peptides shared 
between compared conditions (TopN) with two missed cleavages were 
used. Median and scale normalization was applied on the total signal to 
correct the XICs for each biological replicate. To estimate the signifi- 
cance of the change in protein abundance, alinear model (adjusted on 
peptides and biological replicates) was performed and P values were 
adjusted with a Benjamini-Hochberg false-discovery rate procedure 
witha control threshold set to 0.05. The mass spectrometry proteomics 


data have been deposited to the ProteomeXchange Consortium via 
the PRIDE partner repository with the dataset identifier PX<DO15699. 


Cross-linked CUT&RUN 

CUT&RUN against SPEN was performed during a timecourse of Xist 
induction/SPEN degradation: 0 h DOX, 4h DOX, 8hDOX, 24 h DOX 
and 8h DOX + auxin. Two biological replicates were performed. The 
original CUT&RUN protocol” was adapted for fixed cells: 10° cells in 
suspension were fixed with 2% formaldehyde diluted in PBS for 10 min 
at room temperature (2 ml final volume). Fixation was quenched with 
125 mM glycine for 5 min and cells were washed twice in 1 ml PBS. Fixed 
cells were then permeabilized with 1 ml permeabilization buffer (20 mM 
HEPES pH 7.9, 150 mM NaCl, 0.5 mM spermidine, 0.25% TritonX-100, 
cOmplete EDTA free) for 5 min and washed twice in1 ml PBS. Cells were 
then resuspended in 1 ml washing buffer (20 mM HEPES pH 7.9, 150 mM 
NaCl, 0.5 mM spermidine, 0.1% BSA, cCOmplete EDTA free), bound to 
activated concanavalin beads (50 pl bead slurry used per 10 million 
cells) for 10 min, and blocked in 1 ml blocking buffer (wash buffer + 
2mM EDTA) for 5 min. At this stage, cells were resuspended in 500 ul 
wash buffer containing target antibodies diluted 1/200, transferred to 
0.5-ml tubes, and incubated overnight at 4 °C on anend-to-end rotator. 
Cells were washed three times in 500 pl washing buffer followed by 1-h 
incubation with pA-MNase (500 ul of washing buffer containing 700 
ng ml? pA-MNase, produced by the Protein Expression and Purifica- 
tion Core Facility of Institut Curie) and washed again three times in 
500 pl washing buffer. After the last wash, cells were resuspended in 
150 pl washing buffer, transferred to 1.5-ml tubes, and equilibrated to 
0 °Cina metal block for 10 min. To start digestion, CaCl, was added 
toa final concentration of 1.5 mM, taking care to return each sample 
to 0°C immediately afterwards. Digestion was performed at 0 °C for 
th, before being stopped by adding 150 pl of 2X-STOP solution (200 
mM NaCl, 20 mM EDTA, 5 mM EGTA, 0.1% NP-40, 40 pg mi glycogen). 
RNase A was added to a final concentration of 50 pg ml and samples 
were incubated at 37 °C for 20 min. SDS and proteinase K were then 
added to final concentrations of 0.1% and 300 pg mI", respectively, 
and samples were incubated at 56 °C for 2 h followed by 68 °C for 16h 
to reverse cross-linking. Total DNA was extracted using phenol/chlo- 
roform followed by two rounds of ethanol precipitation and DNA size 
selection (using 0.55x volume of Ampure XP beads relative to the DNA 
sample volume) to remove the large predominating undigested DNA 
fragments. Each time, beads were discarded and the supernatant (con- 
taining the selected small fragments resulting from MNase digestion) 
was precipitated with ethanol. After elution in 50 pl TE buffer, sam- 
ples were quantified and analysed using Qubit and Tapestation assays. 
CUT&RUN libraries were prepared from 50 ng DNA per sample, using 
the Accel-NGS 2S Plus DNA Library Kit (Swift) according to the manu- 
facturer’s protocol. Paired-end 100-nt sequencing was performed on 
a HiSeq2500 (Illumina). 


Bioinformatics analyses 

All data were mapped to the mouse genome mm10, using the BL6-EijJ/ 
CAST SNPs from the mouse genome project (v.5 SNP142), and the gene 
annotation from ensembl (v.92). Analyses were performed in R (v.3.4.2) 
and Bioconductor (v.3.6). See ref. “ for more details. 


RNA-seq analysis 

Reads were trimmed using Trimgalore (v.0.4.4), mapped using STAR 
(2.5.3a, parameters:—outFilterMultimapNmax 1-outFilterMismatch- 
Nmax 999-outFilterMismatchNoverLmax 0.06-alignIntronMax 
500000-alignMatesGapMax 500000-alignEndsType EndToEnd-out- 
SAMattributes NH HINM MD), and removed when mapping to the mito- 
chondrial genome. Remaining reads were split by allele using SNPsplit 
(v.0.3.2). Allele-specific and the unassigned bam files were sorted, 
duplicates removed using picard (v.2.18.2, parameters: REMOVE_DUPLI- 
CATES = true ASSUME_SORTED = true) and pooled as the total reads. 
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Quantification of expression was performed using featureCount 
(parameters: -p -t exon -g gene _id, -s 1 for stranded RNA-seq of in vitro 
cell, -s O for non-stranded RNA-seq of single embryo). Data were then 
analysed in R using DESeq2 (v.1.18.1), calculating the sizeFactor on 
the count of total reads and applying it to the allele-specific counts. 

For all RNA-seq analysis (SPEN-degron mouse ES cells, NPCs, and 
Spen-knockout embryos), genes showing less than 10 total allelic 
reads in at least one sample were discarded from the analysis. Allelic 
ratios were then computed for genes as follows: allelic_ratio = reads®*/ 
(reads®*+reads“*), Allelic ratios were then averaged between biologi- 
cal replicates. 


RNA-seq analysis of SPEN-degron mouse ES cells. In order to define 
differential dependencies on SPEN for gene silencing during XClI in 
mouse ES cells, we removed skewed genes (that is, genes showing allelic 
ratios outside of a [0.15;0.85] interval in control conditions) from the 
analysis in Fig. 1d. We then defined a silencing index, translating how 
mucha geneis silenced after 24 h of Xist induction with respect to the 
control condition: silencing index =1-(allelic_ratio,o,/allelic_ratio,.,. 
tro)» We next filtered out genes showing less than 10% silencing (that is, 
silencing index < 0.1) in SPEN non-depleted conditions. 

k-means with three clusters was then performed on the raw allelic_ 
ratios across control, 24 hDOX and 24h DOX + auxin conditions. Clus- 
tering identified three groups of genes differing by their response to 
loss of SPEN during Xist induction. To define how dependent on SPENa 
geneis for silencing, we expressed the silencing defect observed upon 
loss of SPEN as a fraction of the total silencing that normally occurs 
in the presence of SPEN. Computationally, this translates in: Spen_ 
dependence index =1 - (silencing _indexpox +4,,/silencing_index, x). 
ASpoc_ dependence index was derived identically. 

For integration with Hdac3-knockout RNA-seq during XCI, we inte- 
grated the SPEN-degron dataset with an Hdac3-knockout RNA-seq 
dataset generated from the same mouse ES cell background (TX1072) 
and at the same time point of Xistinduction. The dataset was processed 
identically, and an Hdac3-dependence index was also computed as 
follows: Hdac3_dependence_index =1 - (silencing _indeXygac3x0/silenc- 
ing index). 


Spen-knockout E3.5 embryo RNA-seq analysis. Integration of Spen- 
knockout and Xist-knockout embryo datasets was performed by inte- 
grating our Spen-knockout E3.5 female embryo RNA-seq dataset with 
aXist-knockout single-cell RNA-seq (processed as pseudo-bulk for our 
analysis) dataset from E3.5 female embryos”, also generated froma 
M. musculus domesticus x M. musculus castaneus mouse background. 


SPEN-degron NPC RNA-seq analysis. In NPCs, X-linked genes were 
defined as escapees if their transcript allelic ratio was greater than 0.15 
in at least one condition (0h, 24 h or 48 h of SPEN depletion). 


CUT&RUN bioinformatics analysis 

Reads were trimmed using Trimgalore (v.0.4.4), mapped using STAR 
(2.5.3a, parameters:—outFilterMultimapNmax 1-outFilterMismatch- 
Nmax 999-outFilterMismatchNoverLmax 0.06-alignIntronMax 
1-alignMatesGapMax 2000-alignEndsType EndToEnd-outSAMat- 
tributes NH HI NM MD), and removed when mapping to the mito- 
chondrial genome. Remaining reads were split by allele using SNPsplit 
(v.0.3.2). Allele-specific and the unassigned bam files were sorted, 
duplicates removed using picard (v.2.18.2, parameters: REMOVE_DUPLI- 
CATES = true ASSUME_SORTED = true) and pooled as the total reads. 
BigWig of coverage files were performed using DeepTools bamCover- 
age (parameters:—extendReads-binSize 1, with-extendReads 200 for 
single end data). A scaling factor was calculated as 10°/total number 
of reads, andthe same factor was given as the parameter-scaleFactor 
for both allelic signals. Peak calling was performed using macs2 (v.2- 
2.1.2.1, parameters for CUT&RUN:—bw 300 -f BAMPE -q 0.01-keep-dup 


auto-broad for CUT&RUN and pol2S5 ChIPseq;-bw 300 -f BAMPE -q 
0.01-keep-dup auto-call_summits for other ChIPseq). For quantifica- 
tion of signal in peaks, reads were counted using the featureCounts 
function from Subread (v.1.28.1, parameters: -p -s 0). Data scaling was 
performed in R using DESeq?2 (v.1.18.1), calculating the sizeFactor on 
the counts of total reads in 10-kb windows and applying it to the allele- 
specific counts in peaks. 


Peak filtering. SPEN-specific peaks were defined as having log,Fold- 
Change > 1 compared to auxin treatment (negative control, SPEN-de- 
graded), and an adjusted P value < 0.001. 


Total SPEN enrichment in promoter window. To compare SPEN ac- 
cumulation among promoters of all X-linked genes in an unbiased man- 
ner—including genes that fail to have any peak called at their promot- 
ers—we performed DEseq analysis on counts spanning total promoter 
windows. 


Genomic features and integration with RNA-seq. Promoters were 
defined as +2-kb windows centred around the transcription start sites 
of genes. Putative active enhancers and their deacetylation kinetics 
during XCI were obtained from ref. “*. Gene-silencing efficiency was 
determined according to the silencing index defined in the section 
‘RNA-seq analysis’. We observed that our silencing_index ranges be- 
tween O and 0.9. Hence, we split this interval in three to define high, 
medium and low gene-silencing efficiency groups with silencing index 
comprising [0.6,0.9], [0.3,0.6[and [0,0.3[respectively. 


Integration with publicly available ChIP-seq data. SPEN peaks 
were intersected with other peaks called from publicly available 
ChIP-seq data for HDAC3” (same cellular background, TX1072, 
2i + LIF condition), RNAPII-pS5”°, CHD4” and MBD3” (all in the 2i + LIF 
condition). 


Hi-C analysis 

Data were processed with HiC-Pro (v.2.11.0) in allele-specific mode. Only 
pairs with both reads having MAPQ > 30 were kept. Matrices were made 
using cooler cload (v.0.8.5) at 1-kb or 10-kb resolution, using HiGlass 
for visualization and snapshots. Topologically associating domains 
were called using HicExplorer HicFindTads (parameters:—correctFor- 
MultipleTesting fdr-minDepth 250000-maxDepth 4000000-step 
50000-thresholdComparisons 0.1-delta 0). The expected value for 
the Hi-C signal was calculated on the non-allele-specific signal using 
cooltool compute-expected. Average scaled matrices of observed/ 
expected values for allele-specific signal were produced with Coolpup. 
py (parameters:-local-rescale-rescale_size 299), using non-allele- 
specific expected values to normalize both alleles to the same expected 
values. Average heat maps were plotted using plotpup.py. For quanti- 
fication, the Hi-C signal was averaged over topologically associating 
domains upper-triangle for each allele-specific matrix (10 kb) using 
the hicExplorer hicSummarizeScorePerRegion available at https:// 
github.com/heard-lab/HiCExplorer (parameters:-summarizeType 
mean-rmDiag 1). 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 

RNA-seq, Hi-C and CUT&RUN data used in this study have been 
deposited in the Gene Expression Omnibus under accession number 
GSE131784. Source Data for Figs. 1-3 and Extended Data Figs. 1, 3,4 are 
provided with the paper, either in the form of supplementary tables 
or source data files. 
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Extended Data Fig. 1| See next page for caption. 


Extended Data Fig. 1| SPEN mediates gene silencing across the entire X 
chromosome in vitro and in vivo. a, Schematic representation of the 
SPEN-degron genotype with AID-HaloTag insertions in frame with the C 
terminus of endogenous SPEN. Targeted homozygous insertion of V5-tagged 
OsTIR1at the 7/GRE locus (top left) results in its constitutive protein expression 
as assessed by western blot (bottom left). Right, Sanger sequencing results for 
aPCRamplicon specific to AID-HaloTag insertions and covering a SNP outside 
of the recombined left homology arm. Detection of both alleles inthe amplicon 
confirms homozygous AID knock-in. b, Fixed-cell imaging of HaloTag in wild- 
type cells (left), in SPEN-degron mouse ES cells (middle) and in SPEN-degron 
mouse ES cells exposed to auxin for 4h (right). Cells were labelled with Halo- 
JF646 before fixation. SPEN-Halo is properly localized to the nucleus, and is 
depleted upon auxin treatment. This experiment was repeated at least twice 
with similar results. c, Bar graph showing the proportion of cells displaying Xist 
RNA clouds (quantified using RNA FISH) before and after degradation of SPEN 
(n, number of cells counted; x” test). d, Violin plot showing the distribution of 


X-chromosomal transcript allelic ratios (obtained by RNA-seq) after Oh DOX, 
24hDOX or 24hDOX + auxin treatment in wild-type SPEN-degron mouse ES 
cells. Horizontal lines denote the median, box limits correspond to upper and 
lower quartiles, averages of two independent clones shown, n= 434 genes, 
two-sided Student’s t-test. e, RNA FISH experiments for Xist (red) and two 
X-linked genes: Atrx (grey) and Huwel (green), in SPEN-degron mouse ES cells 
treated with DOX only, or DOX in combination with auxin for 24 h. The 
proportion of Atrx/Huwel monoallelic and biallelic expression among Xist- 
expressing cells is shown (n, number of cells counted; x’ test). f, Illustration of 
the control hybrid mouse crossbreeding scheme for the experiment shownin 
Fig. 1g,h.g, Quantitative PCR (qPCR) analysis of Spen and Xist transcripts in 
wild-type (n=7) and maternal-zygotic Spen-knockout (n=5) E3.5 embryos. 

h, Pyrosequencing assay of three X-linked transcripts in maternal-zygotic Spen- 
knockout (n=5) and wild-type (n=7) E3.5 embryos (two-sided Student’s t-test). 
Ing,h, bars show the mean value and individual data points are shownas dots. 
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Extended Data Fig. 2|See next page for caption. 
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Extended Data Fig. 2| SPEN localizes to the X chromosome immediately 
upon Xist upregulation and throughout the stages of XCI, but is dispensable 
for maintenance of X-linked gene silencing. a, Scheme of the strategy for live- 
cell imaging of SPEN protein and Xist RNA. b, Live-cell snapshot after 16 h of Xist 
induction in the cell line shown ina. This experiment was repeated at least 
twice with similar results. c, d, Kinetics of total intensity (c) and area (d) of Xist 
(red) and SPEN (green) domains over time during Xist induction. The datain 

c, dare the averages of 27 tracked cells. Error bars indicate standard deviation. 
Images were acquired every 10 min. Time point lis defined as the earliest time 
at whicha SPEN or Xist domain is detected in each cell. Intensity and area values 
were respectively normalized to the maximum value reached for each signal 


(SPEN and Xist). e, Hi-C map of the inactive (top) and active (bottom) X 
chromosomes (resolution, 1.024 Mb) in NPCs after 0 hor 48 h of auxin- 
mediated SPEN depletion. f, Heat map of the average contact enrichment on 
scaled topologically associating domains containing escapees in NPCs after 
Ohor 48h of auxin-mediated SPEN depletion. g, Quantification of the allelic 
ratio (inactive/active X chromosome) of the Hi-C signal within topologically 
associating domains (n=37) shown inf, after 0 hor 48 hof auxin-mediated 
SPEN depletion. Horizontal lines denote the median, box limits correspond to 
upper and lower quartiles, two-sided Wilcoxon rank-sum test. Ine, f, averages 
of two independent clones are shown. 
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Extended Data Fig. 3 | See next page for caption. 


Extended Data Fig. 3 | The SPOC domain of SPEN mediates gene silencing 
and interacts with multiple molecular pathways. a, Scheme of 
complementation strategy. b, Western blot detection of overexpressed 3xFlag- 
tagged SPEN protein rescue fragments. c, Scheme showing endogenous 
deletion of SPOC. d, Sub-nuclear localization of endogenous SPEN lacking its 
SPOC domain upon Xist RNA induction. The inactive X chromosome is 
identified using immunofluorescence detection of H2AK119ub1.e, Bar graph 
showing the proportion of cells with Xist RNA clouds (assayed by RNA FISH) in 
wild-type cells and three independent SPOC-deletion clones after induction of 
Xist for 24 h(n, number of counted cells). f, RNA FISH for Xist (red) and Huwel 
(green) in SPOC-deletion and wild-type cells treated with DOX for 24 h.g, Violin 
plot showing the distribution of X-chromosomal transcript allelic ratios 
(measured by RNA-seq) after 0 h or 24h DOX treatment in wild-type and SPOC- 
deletion mouse ES cells. Horizontal lines denote the median, box limits 
correspond to upper and lower quartiles, averages of three independent clones 
shown, n= 469 genes, two-sided Student’s ¢-test. h, Bar graph of transcript 
allelic ratios (obtained from pyrosequencing) for four X-linked genes in SPOC- 
deletion (blue) or wild-type (grey) cells. Bars show mean values for three 


independent SPOC-deletion clones (*P<10™*, two-sided Student's t-test). 

i, Bar graph showing the proportion of cells expressing Huwel monoallelically 
(white) or biallelically (grey), assayed by RNA FISH, in wild-type cells and in 
three independent SPOC-deletion clones after induction of Xist for 24h 

(n, number of counted cells). j, Density plot showing the distribution of gene 
silencing defects (see Methods) observed across the X chromosome in RNA- 
seq data from HDAC3-knockout™ SPEN-degron and SPOC-deletion (this study) 
ES cells after 24 h of Xist induction. k, Bar graph of normalized allelic ratios 
(obtained from pyrosequencing) for four X-linked genes in HDAC3-knockout 
(brown), SPOC-deletion (blue) and wild-type (grey) cells after 24 h of Xist 
induction. Bars show mean values for two independent HDAC3 clones and 
three independent SPOC deletion clones; individual data points are shown. 

I, Volcano plot of fold changes in GFP-pull-down (BgIG-GFP-SPOC compared 
with BgIG-GFP) and their adjusted P values (Benjamini-Hochberg procedure, 
see Methods for statistical analysis). Quantitative label-free mass 
spectrometric analysis was performed on four independent biological 
replicates. Inb, d, f, experiments were repeated at least twice with similar 
results. 
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Extended Data Fig. 4| See next page for caption. 
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Extended Data Fig. 4 | SPENis recruited by Xist to active gene promoters and 
enhancers where it silences transcription and subsequently disengages 
from chromatin. a, Bar graph showing the number of SPEN peaks on each 
chromosome after 0h, 4h, 8hand 24 hof Xistinduction in mouse ES cells. 

b, Annotation of SPEN peaks on autosomes. c, Heat map showing allelic ratios 
at SPEN peaks during XCI among different X-linked genomic features. d, Violin 
plot showing expression (RPKM) of genes accumulating SPEN (n=259) or not 
accumulating SPEN (n= 689) at their promoters. Genes showing O RPKM were 
excluded from this plot. e, Box plots showing SPEN enrichment after 4 h of Xist 
induction within promoter windows of genes grouped on the basis of their level 
of dependency on SPEN for gene silencing (see Fig. le). f, Box plots showing 
SPEN enrichment after 4 h of Xist induction within promoter windows of genes 
grouped onthe basis of whether or not they are silenced at 24 h of Xist 


induction (see Methods). In d-f, data were analysed using the two-sided 
Wilcoxon rank-sum test, horizontal lines denote the median, box limits 
correspond to upper and lower quartiles. g, UCSC Genome Browser allele- 
specific track showing SPEN binding around Kdméa, an escaping gene (blue, 
Cast-Xa; red, B6-Xi; all tracks are scaled identically). h, Bar graphs showing 
overlap between SPEN-binding sites and the binding sites of four different 
factors at X-linked enhancers and promoters. i,j, Heat maps showing 
normalized SPEN enrichment (log,) at promoters (both replicates are shown) 
(i) and gene silencing kinetics (allelic ratio) during XCI (j) within three groups of 
X-linked genes showing different dynamics of SPEN accumulation and loss. 
k, Schematic of the function of SPEN in XCI. Ina-f, h-j, data are from two 
biological replicates. 
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Extended Data Fig. 5 | UCSC Genome Browser allelic tracks of SPEN binding 


and transcript expression at X-linked genes. a—n, Top, Genome Browser 


allelic tracks of SPEN binding (from CUT&RUN) at silenced genes (a-g) and non- 
silenced genes (h-n) during atime course of Xist inductionin mouse ES cells 


tracks of transcript expression (from RNA-seq) at O hand 24h of Xistinduction 


in mouse ES cells (light grey, Cast-Xa; black, B6-Xi; scaled identically within 


shownat the top of the figure. 


(blue, Cast-Xa; red, B6-Xi; scaled identically within each panel). Bottom, allelic 


each panel). The relative position of each gene along the X chromosome is 
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Reporting Summary 


Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency 
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist. 


Statistics 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
“— AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


Oo For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection All microscopy images were acquired either with an Inverted Confocal Spinning Disk Roper/Nikon for fixed cells or a Super-resolution 
microscope OMX (Applied Precision Incorporation, DeltaVision) for live cells. Sequencing data was collected using the Illumina platform. 
Pyrosequencing data was collected using the PyroMark Q24 System from Qiagen. QPCR was performed on a ViiA 7 Real-Time PCR System 
from Thermo Fisher Scientific. Mass Spectrometry was performed on Thermo Scientific Orbitrap Fusion Tribrid MS from Thermo Fisher 
Scientific. Western blot (chemiluminescent) images were collected using a ChemiDoc MP from BioRad. 


Data analysis Trimgalore (v 0.4.4), STAR (2.5.3a), SNPsplit (v 0.3.2), picard (v2.18.2), DESeq2 (v1.18.1), R (3.4.1), featureCount (1.6.3), Fiji (2.0.0), dplyr 
(0.7.4), readr (1.1.1), tidyr (0.8.0), ggplot2 (2.2.1), macs2 (v 2-2.1.2.1), Subread (v1.28.1), HiC-Pro (v2.11.0) 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 
- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


All data generated in this study are available on GEO database under the number GSE131784. 
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Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


x Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size Sample size was not predetermined. For mouse experiments, we used a sample size commonly used and accepted for basic statistical 
inference while using an justifiable number of mice. For tissue culture based experiments, at least two independent clones were 
systematically assessed for each genotype. Typically, for the SPOC tethering experiment, 4 independent clones were characterized with very 
little variation being observed between them. 
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Data exclusions No data were excluded from the analysis, with the exception of live imaging analysis, from which dying cells, as well as cells which moved 
outside from the field of view during movie acquisition were removed. 


Replication All attempts at replication were successful and noted in the relevant figure legend. 


Randomization Samples were not randomized, given that samples were grouped according to their respective genotypes. 


Blinding For most experiments, no blinding was performed as most measurements were derived from third party machines/softwares, and hence not 
affacted by subjective interpretation. For RNA FISH image analysis (counting of Xist RNA clouds and X-linked gene pinpoints), each 
experimental condition was blinded from the first author F. Dossin during counting. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


Human research participants 


Clinical data 


Antibodies 


Antibodies used Epitope, Antibody, reference, Application, Dilution 
HaloTag, Promega cat. #69211, WB, 1/1000 

V5, Sigma cat. #V8012, WB, 1/2500, 

Flag, Sigma cat. #71804, WB and IF, 1/1000 and 1/200 
GFP, Abcam cat. #ab290 LotGR3222604-1, Cut&Run and IF, 1/200 
GFP, Roche cat. #11814460001, WB, 1/1000 
H2Ak119ub1, Cell Signaling cat. #8240, IF, 1/500 

Lamin B1, Abcam cat. #ab16048, WB, 1/3000 

PCNA, DAKO cat. #M0879, WB, 1/3000 

cori, Abcam cat. #ab2482 LotGR320472-6, WB, 1/1000 
cor2, Abcam cat. #ab5802 LotGR259451-18, WB, 1/1000 
ta1, Cell Signaling cat. #5647, WB, 1/1000 

Wtap, Proteintech cat. #10200-1-AP, WB, 1/1000 

ettl3, Abcam cat. #ab195352, WB, 1/1000 

Hdac3, SantaCruz cat. #sc-376957, WB, 1/1000 

Rpb1, Abcam cat. #ab817, WB, 1/2000 


Validation The HaloTag antibody is validated for western blot in Fig. 1b, with specific signal disappearing upon auxin treatment, and 
reappearing upon removal of auxin from the culture medium. 
The V5 antibody is validated for Western blot in Extended Data Fig. 1a, with specific signal being observed only in cells expressing 


V5-tagged Tirl. 

The Flag antibody is validated for immunofluorescence in Fig. 3c, with specific nuclear signal being observed only in cells 
expressing Flag-tagged SPEN truncations. 

The GFP antibody is validated for CUT&RUN by showing specific genomic signal for GFP-tagged SPEN, which disappears upon 
treatment of cells with auxin. 

The H2AK119ub1 has been validated for immunofluorescence in Zylicz et al., 2019 

The Lamin B1 antibody have been KO validated by abcam. 

The PCNA antibody is cited 361 times according to CiteAb. 

The Ncor1/Ncor2 antibodies are validated for western blot in Fig3. h, as showing increased signal upon immunoprecipitation of 
SPOC. 

The Mta1 antibody has been cited 12 times previously according to CiteAb. 

The Wtap antibody have been KO validated by ProteinTech. 

The Mettl3 antibody have been KO validated by Abcam. 

The Hdac3 antibody has been cited 9 times according to CiteAb. 

The Rpb1 antibody has been cited 280 times previously according to CiteAb. 
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Policy information about cell lines 


Cell line source(s) All cell lines were derived from the TX1072 female mouse embryonic stem cell line (Schulz et al., 2014) 
Authentication None of the cell lines were authenticated 
Mycoplasma contamination All cell lines tested negative for mycoplasma contamination 


Commonly misidentified lines cells used are not in the ICLAC database 
(See ICLAC register) 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals We used adult animals (from 6 weeks to 3 months for females, and 7 weeks to 1 year for males) for producing preimplantation 
embryos (first 4 days post mating). For Spen matings a published conditional allele was used (Yabe at al., 2007). For oocyte 
deletions published Rosa26:Zp3-Cre allele was used (DeVries et al., 2000). F1 hybrid Spen+/- males were obtained by crossing 
Spen+/+ CAST/EiJ females with Spen+/- C57BL/6J males. For Spen maternally deleted embryos, Spenflox/flox Zp3-Cre+ve 
C57BL/6J females were crossed with Spen+/- F1 hybrid males. For Spen control embryos, Spenflox/flox Zp3-Cre-ve C57BL/6] 
females were crossed with Spen+/- F1 hybrid males. 


Wild animals this study does not contain any wild animals 
Field-collected samples this study does not contain animals collected from the fields. 
Ethics oversight Animal care and use for this study were performed in accordance with the recommendations of the European community 


(2010/63/UE). All experimental protocols were approved by the ethics committee of Institut Curie CEEA-IC118 under the number 
APAFIS#8812-2017020611033784v2 given by national authority in compliance with the international guidelines. 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


Article 


NEDDS§8 nucleates a multivalent cullin-RING- 
UBE2D ubiquitin ligation assembly 


https://doi.org/10.1038/s41586-020-2000-y 


Received: 16 August 2019 


Accepted: 9 January 2020 


Kheewoong Baek’, David T. Krist", J. Rajan Prabu', Spencer Hill’, Maren Kliigel', 
Lisa-Marie Neumaier', Susanne von Gronau’, Gary Kleiger? & Brenda A. Schulman'™ 


Published online: 12 February 2020 


® Check for updates 


Eukaryotic cell biology depends on cullin—RING E3 ligase (CRL)-catalysed protein 
ubiquitylation’, which is tightly controlled by the modification of cullin with the 
ubiquitin-like protein NEDD8? ®. However, how CRLs catalyse ubiquitylation, and the 


basis of NEDD8 activation, remain unknown. Here we report the cryo-electron 
microscopy structure of a chemically trapped complex that represents the 
ubiquitylation intermediate, in which the neddylated CRL1°™® promotes the transfer 
of ubiquitin from the E2 ubiquitin-conjugating enzyme UBE2D to its recruited 
substrate, phosphorylated IkBa. NEDD8 acts as a nexus that binds disparate cullin 
elements and the RING-activated ubiquitin-linked UBE2D. Local structural 
remodelling of NEDD8 and large-scale movements of CRL domains converge to 
juxtapose the substrate and the ubiquitylation active site. These findings explain how 
a distinctive ubiquitin-like protein alters the functions of its targets, and show how 
numerous NEDD8-dependent interprotein interactions and conformational changes 
synergistically configure a catalytic CRL architecture that is both robust, to enable 
rapid ubiquitylation of the substrate, and fragile, to enable the subsequent functions 
of cullin-RING proteins. 


CRLs orchestrate numerous eukaryotic processes—including transcrip- 
tion, signalling, cell division and differentiation—and CRL dysregulation 
underlies many pathologies. The activities of these enzymes depend on 
coordinated but dynamic interactions between dedicated cullin-RING 
complexes and several regulatory partner proteins’. Cullins (CULs) 
1-5 bind cognate RING-containing partners (RBX1 or RBX2) through 
a conserved intermolecular cullin/RBX (hereafter C/R) domain, in 
whicha CUL B-sheet stably embeds an RBX strand’. On one side of the 
C/R domain, the CUL N-terminal domain can associate interchange- 
ably with numerous substrate-recruiting receptors. As examples, 
human CUL1-RBX!1 binds around 70 SKP1-F-box protein complexes 
and CUL4-RBXI1 binds around 30 DDB1-DCAF complexes, forming 
the E3 enzymes CRL1F °°" or CRL4°C", respectively (in which F-box 
protein and DCAF represent substrate receptors for a given CRL)*™. 
The C-terminal WHB domain of the CUL protein and the RING domain 
of the RBX protein emanate from the other side of the C/R domain. 
To achieve E3 ligase activity, the RING domain recruits one of several 
ubiquitin-carrying enzymes, which presumably use distinct mecha- 
nisms to transfer ubiquitin to receptor-bound substrates. 

CRLs are regulated by reversible NEDD8 modification of a spe- 
cific lysine residue within the WHB domain of CULs. Although it has 
approximately 60% sequence identity to ubiquitin, NEDD8 uniquely 
activates CRL-dependent ubiquitylation”. NEDD8 has been suggested 
to have multiple roles in catalysis—including assisting in the recruit- 
ment of ubiquitin-carrying enzymes, facilitating juxtaposition of the 
substrate and ubiquitylation active site, and promoting conforma- 
tional changes—although the structural mechanisms of these effects 


remain unknown* >”. NEDD8 also stabilizes cellular CRLs by blocking 
the exchange factor CAND1 from ejecting substrate receptors from 
unneddylated CUL-RBX complexes”. Neddylation controls around 20% 
of ubiquitin-mediated proteolysis and presumably many nondegrada- 
tive functions of ubiquitin, and an inhibitor of NEDD8 (MLN4924, also 
known as Pevonedistat) blocks HIV infectivity and is in clinical trials as 
ananticancer agent", 

Here we determine structural mechanisms that underlie ubiqui- 
tylation by human neddylated CRL1I°™ and E2 enzymes from the 
UBE2D family, in which the F-box protein B-TRCP recruits a specific 
phosphodegron motif in substrates including B-catenin and IkBa®°. 
UBE2D knockdown stabilizes the CRLI°™° substrate IkBa”, whereas 
mutations that impair the ubiquitylation of B-catenin by CRLIS™®® 
promote tumorigenesis”. Furthermore, hijacking of CRLI°T° enables 
HIV to evade host immunity”, and deamidation of GIn40 of NEDD8 by 
an enteropathogenic and enterohaemorrhagic Escherichia colieffector 
results in the accumulation of the CRL1®™®° substrate IkBa as well as 
substrates of other CRLs™**°. 


NEDD8 activation of ubiquitylation 

We used rapid quench-flow methods to obtain kinetic parameters for 
CRLI°T8° and UBE2D catalysed ubiquitylation of model substrates 
(phosphopeptides from B-catenin and IkBa, containing single accep- 
tor lysines). We found that NEDD8 substantially stimulates the reac- 
tion, by nearly 2,000-fold (Fig. 1a, Extended Data Fig. 1, Extended 
Data Table 1). Performing experiments under conditions that allow 
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Fig. 1| Role of NEDD8 and strategy for the visualization of dynamic ubiquitin 
transfer from UBE2D to substrate by neddylated CRL15"®®, a, Effect of the 
neddylation of CUL1 on CRLI°®-catalysed ubiquitin transfer from UBE2D toa 
radiolabelled IkBa-derived peptide substrate. The plots show the proportion 
of substrate remaining during pre-steady-state rapid quench-flow 
ubiquitylation reactions with saturating UBE2D3 and either unneddylated or 
neddylated CRL1°™®“. The symbols show the data from independent 
experiments (n=2 technical replicates). b, Schematic representing substrate 
priming by neddylated CRL1°"® and UBE2D-Ub. The inset shows the 
transition state during ubiquitylation. c, Chemical mimic of the ubiquitylation 
intermediate, in which surrogates for the active site of UBE2D, the C terminus 
of ubiquitin and the ubiquitin acceptor site on the IkBa-derived substrate 
peptide are simultaneously linked. 


for multiple UBE2D turnover events enabled us to quantify the effects 
of neddylation on substrate ‘priming’, in which ubiquitin is ligated 
directly to the substrate, compared with ‘chain elongation’, in which 
it is linked to a substrate-linked ubiquitin. The individual rates for the 
linkage of successive ubiquitins during polyubiquitylation showed 
that NEDD8 activates both substrate-priming and chain-elongation 
reactions. However, ubiquitin ligation to a substrate is tenfold faster 
than ubiquitin ligation to a substrate-linked ubiquitin, suggesting that 
neddylated CRL1°™“’—together with UBE2D—optimally catalyses sub- 
strate-priming reactions (Extended Data Table 1). 


Cryo-EM reveals cullin-RING dynamics 

Ubiquitin transfer from a RING-docked UBE2D-Ub intermediate (in 
which ~ indicates a thioester bond or thioester-bond mimic, Ub indi- 
cates ubiquitin) to a substrate that is bound to an F-box protein was 
difficult to rationalize from previous structural models?””°”’ and from 
our cryo-electron microscopy (cryo-EM) reconstructions of unned- 
dylated and neddylated substrate-bound CRL1°™®° (Extended Data 
Fig. 2, Extended Data Table 2). We observed a well-resolved ‘substrate- 
scaffolding module’ that resembles models based on crystal structures 
of substrate-bound SKP1-8-TRCP and the portion of an F-box-SKP1- 
CUL1-RBX1 complex that includes the N-terminal domain of CUL1 and 
the intermolecular C/R domain’”°. However, for the RING domain of 
RBX1and the WHB domain of CUL1—with or without covalently linked 
NEDD8~—density is either lacking or visualized only at low contour, in 
varying positions in different classes. These domains apparently sample 
multiple orientations, and it is therefore difficult to conceptualize rapid 
ubiquitylation of a flexible substrate by uncoordinated nanometre- 
scale motions of RBX1-activated UBE2D~Ub (Extended Data Fig. 2). 
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Capturing CRL-substrate ubiquitylation 


Substrate priming involves fleeting simultaneous linkage of the active 
site of UBE2D, the C terminus of ubiquitin and the substrate (Fig. 1b). 
We chemically linked surrogates for these entities to form a stable 
mimic of the transition state, which avidly binds neddylated CRLIP TR’ 
(Fig. 1c, Extended Data Fig. 3a—d). After screening several complexes by 
cryo-EM, we obtained a reconstruction at 3.7 A resolution that showed 
our proxy for a UBE2D-Ub-IkBa substrate intermediate bound toa 
hyperactive version of neddylated CRL1°™*® (Fig. 2a, Extended Data 
Figs. 3, 4, Extended Data Table 2). 

This complex, representing the neddylated CRL1?™°-UBE2D-Ub- 
substrate intermediate, explains rapid ubiquitylation through unprec- 
edented neddylated cullin-RING arrangements (Extended Data Fig. 5). 
NEDD8 is nearly encircled by interactions; it binds the WHB domain 
of CUL1—to which it is linked—in an ‘activation module’, and positions 
a ‘catalytic module’ relative to the substrate-scaffolding module to 
juxtapose the active site and the substrate (Fig. 2b—d). 

Inthe catalytic module, RBX1 binds UBE2D-Ub in the canonical RING- 
activated ‘closed’ conformation, in which noncovalent interactions 
between UBE2D and ubiquitin allosterically activate the thioester 
bond between them”® °°. Compared with previously isolated RING- 
UBE2D-Ub structures”**°, the neddylated CRLI°™’-UBE2D-Ub-sub- 
strate intermediate shows additional density corresponding to the 
substrate proxy along a trajectory to the ubiquitylation active site 
(Fig. 2c, Extended Data Figs. 6a, 7b). AUBE2D groove seems to engage 
the substrate polypeptide in a manner poised to assist in projecting 
adjacent lysines into the active site. This engagement of substrate 
polypeptide may contribute to the ability of UBE2D to ubiquitylate a 
broad range of proteins”. 

The structure revealed that the distance between the B-TRCP-bound 
phosphodegron of IkBa and the UBE2D-Ub active site is around 22 A, 
which is compatible with the spacing between this motif and potential 
acceptor lysines in many substrates* (Extended Data Fig. 6). We could 
therefore make the following predictions: firstly, peptide substrates 
with sufficient residues (for example, 13 and 9) between the phosphode- 
gronand the acceptor lysine to spana distance of 22 A should be rapidly 
primed ina NEDD8-dependent manner; secondly, a peptide substrate 
with too few spacer residues (for example, 4) to span this gap should be 
severely impaired for priming by neddylated CRL1°™° and UBE2D, but 
that the addition ofa ubiquitin could satisfy geometric constraints and 
enable further polyubiquitylation; and thirdly, the substrates should 
show little difference in UBE2D-mediated priming with unneddylated 
CRL19™®?, Comparing kinetic parameters for peptide substrates with 
13-, 9-, and 4-residue spacers confirmed these predictions, with the 
priming rate of the latter peptide with neddylated CRLI°™ reduced 
below the limit of our quantification (Extended Data Fig. 6, Extended 
Data Table 1). 


NEDDS coordinates ubiquitin ligation assembly 
NEDD8& is covalently linked to the WHB domain from CULI, and 
together they form a globular activation module. A NEDD8 groove, 
comprising the Ile36/Leu71/Leu73 hydrophobic patch and the C-ter- 
minal tail, embraces the hydrophobic face of the isopeptide-bound 
CUL1 helix (Figs. 2d, 3a, b). At the centre, GIn40 of NEDD8 contacts 
CULI, the isopeptide bond, and the C-terminal tail of NEDD8 ina 
buried polar interaction that is typical of such organizing apolar 
interfaces®. This rationalizes how pathogenic bacterial effectors 
that catalyse GIn40 deamidation impair CRL1-dependent ubiquity- 
lation”***: the resultant negative charge would destroy the CULI- 
NEDD8 interface. 

The activation module binds the catalytic module, which explains 
how neddylation helps CRL1°7®° to recruit UBE2D in cells**.The 
Ile44 hydrophobic patch of NEDD8 engages the ‘backside’ of UBE2D, 
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Fig. 2|Cryo-EM structure representing neddylated CRL1°™-mediated 
ubiquitin transfer from UBE2D to the substrate IkBa. a, Cryo-EM density 
representing the neddylated CRL1°™-UBE2D-Ub-IkBa substrate 
intermediate, in which UBE2D-Ubis activated and juxtaposed with the 
substrate. b, The substrate-scaffolding module connects B-TRCP-bound 


whichis opposite the active site for ubiquitylation (Fig. 3c, Extended 
Data Fig. 7a). Contacts resemble those described for free NEDD8 
or ubiquitin binding to the backside of UBE2D and allosterically 
stimulating the intrinsic reactivity of an isolated RING-UBE2D-Ub 
subcomplex?**’. We examined intrinsic reactivity by monitoring 
ubiquitin discharge to free lysine using a previously described hyper- 
active neddylated CRLI°™ mutant” and high concentrations of 
enzyme and lysine. Substrate-independent ubiquitin transferase 
activity was impaired by aNEDD8 mutant that disrupts the integrity 
of the activation module, and by mutations of UBE2D that hinder its 
interactions with the covalently linked ubiquitin, the RING domain of 
RBX1 or NEDD8 (Extended Data Fig. 7c-f). The architecture observed 
in the neddylated CRL1°™“?-UBE2D-Ub-substrate complex struc- 
ture may therefore both stimulate the intrinsic reactivity of the 
UBE2D-Ub intermediate and place the catalytic centre in proximity 
to the B-TRCP-bound substrate. 

The activation module is itself positioned by the binding of NEDD8 
to the substrate-scaffolding module. Leu2, Lys4, Glul4, Asp16, Arg25, 
Arg29, Glu32, Gly63 and Gly64 of NEDD8—which nestle in a concave 
CUL1 surface—differ in ubiquitin, and these amino acids account for 
nearly one-third of the differences in sequence between the two pro- 
teins (Fig. 3d). The counterparts of these amino acids in ubiquitin would 
be expected to repel CULL, which rationalizes the need for NEDD8 asa 
distinctive ubiquitin-like protein. 

In addition, the catalytic module contacts both CULI-RBX1and the 
substrate receptor sides of the substrate-scaffolding module (Fig. 3e, 
Extended Data Fig. 5). On one side, the RING domain of RBX1 stacks on 
the Trp35 side chain of its C/R domain, which is consistent with previ- 
ously reported effects of introducing a W35A mutation into RBX1”. On 
the other side, the curved B-sheet of UBE2D complements the propeller 
of B-TRCP (Fig. 3e). 


NEDD8 conformation coincidence coupling 

Different conformations of ubiquitin and ubiquitin-like proteins have 
long been known to influence their interactions, with ubiquitin-binding 
domains selecting between ‘loop-in’ or ‘loop-out’ orientations of the 


d Activation module 


60° 150° 


substrate to the intermolecular cullin—-RBX (C/R) domain. c, The catalytic 
module consists of RING-UBE2D-Ub of RBX1in the canonical closed activated 
conformation, and additional density corresponding to the chemical surrogate 
for the substrate undergoing ubiquitylation. d, NEDD8 and the covalently 
linked WHB domain of CUL1 form the activation module. 


Leu8-containing B1/B2-loop. However, it remains largely unknown 
how these conformations might simultaneously affect binding to 
multiple partners*®. Our data show that NEDD8 must adopt the loop- 
out conformation to both form the activation module and to engage 
UBE2D in the catalytic module (Fig. 3f, Extended Data Fig. 7g-i). The 
conformation of NEDD8 apparently serves as a coincidence detector, 
coupling noncovalent binding to the linked WHB domain of CUL1and 
to the catalytic module. 


Synergistic catalytic assembly 


We next assessed the importance of the structurally observed interfaces 
by testing the effects of mutations in relevant regions of the proteins. 
We used an attenuated pulse-chase assay format that qualitatively, but 
exclusively, monitors NEDD8-activated substrate priming by CRLIE®? 
and UBE2D (Extended Data Fig. 8). Mutations that were designed to 
destroy the activation module (NEDD8(Q40E)) or to hinder interac- 
tions between the activation and catalytic modules (NEDD8(I44A) or 
UBE2D(S22R)) substantially impaired substrate priming, as did swap- 
ping key NEDD8 residues at the interface with the substrate-scaffolding 
module with those found in ubiquitin. Moreover, although the struc- 
tural basis for ubiquitylation by other CRLs requires further investiga- 
tion, these mutations also impair the UBE2D-mediated priming of a 
cyclin E phosphopeptide substrate with neddylated CRL1™™”, and of 
IKZF zinc finger 2 by neddylated CRL4°8®™Pomalidomide (Fxtended Data 
Fig. 8g-o). 

Given that the neddylated CRL19™°-UBE2D-Ub-substrate inter- 
mediate depends on several conformational changes and large inter- 
faces within and between modules (Fig. 2, Extended Data Figs. 5, 7), 
we hypothesized that pairing mutations that affect various interfaces 
would have synergistic effects. We used two types of experiment to 
define kinetic parameters for peptide substrate ubiquitylation—the 
Michaelis constant, K,,, was measured by titrating UBE2D, and the 
rate constant, k,,, for ubiquitin transfer was determined by rapid 
quench-flow at saturating UBE2D concentrations to isolate catalytic 
defects. We drew three main conclusions from the results (Fig. 4a, 
Extended Data Fig. 1, Extended Data Table 1). First, although no 
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Fig. 3 | Intra- and inter-module interfaces specifying the catalytic 
architecture for ubiquitin priming of the substrate by neddylated CRLIST®’ 
with UBE2D. a, Cryo-EM density highlighting noncovalent interfaces that 
contribute to the catalytic architecture for neddylated CRL1I°™-mediated 
ubiquitin transfer from UBE2D toa substrate. Circled regions correspond to 
interfaces within the activation module, and between activation and catalytic, 
activation and substrate-scaffolding, and catalytic and substrate-scaffolding 
modules shown in b-f. b, Close-up view of the intra-activation module 
interface, showing the buried polar residue GIn40 of NEDD8 and the Ile36/ 
Leu71/Leu73 hydrophobic patch making noncovalent interactions with the 
WHB domain of CULI, adjacent to the isopeptide bond linking NEDD8 and 
CULI1.c, Close-up view of the interface between the activation and catalytic 
modules, showing key residues at the interface between NEDD8 and the 
backside of UBE2D. d, Close-up view highlighting (in orange) the residues of 
NEDD8 that differ in ubiquitin, and that are at the interface with the substrate- 
scaffolding module. e, Close-up view highlighting His32 of UBE2D at the 
interface with the substrate-scaffolding module. f, Close-up view indicating 
the role of the loop-out conformation of NEDD8, whichis required for the 
binding of UBE2D. 


individual mutation is as detrimental as eliminating neddylation, 
the mutation of each interface on its own shows a decrease in k,./Km 
of approximately tenfold or more compared with the wild-type. The 
most detrimental ‘single’ mutation is the substitution of NEDD8 with 
Ub(R72A)-this confirms the importance of NEDD8 and of interac- 
tions between the activation- and substrate-scaffolding modules. The 
next most detrimental mutation is Q40E in NEDD8-—this underscores 
the importance of the structure of the activation module and its 
role in establishing the loop-out conformation of NEDD8. Second, 
combining mutations at several sites has a devastating effect on 
activity, even if the effect of the individual mutations alone is mild. 
For example, when using NEDD8(144A) the activity is reduced by 
tenfold compared with the wild-type; however, when this mutant 
is combined with UBE2D(H32A) at the catalytic module-substrate 
receptor interface—which is mildly defective in an attenuated assay 
with subsaturating UBE2D (Extended Data Fig. 8f)—a near 200-fold 
reduction in activity is observed. Third, the mutations have com- 
paratively little effect on chain elongation, which is consistent with 
the structure of the neddylated CRL1°™-UBE2D-Ub-substrate 
intermediate defining ubiquitin transfer from UBE2D directly to the 
unmodified substrate. 
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Discussion 


Our cryo-EM structure representing the neddylated CRLIST°- 
UBE2D-~Ub-substrate intermediate suggests a model for substrate 
priming that addresses many longstanding questions. First, rapid 
substrate ubiquitylation can be explained by NEDD8, the cullin and 
RBX1-bound UBE2D~Ub making numerous interactions that activate 
UBE2D and synergistically place the catalytic centre adjacent to B-TRCP. 
Second, biochemical features of neddylated CRLs that were incompat- 
ible with previous structures are now rationalized, including NEDD8- 
stimulated crosslinking between a CRL1°"®-bound phosphopeptide 
and UBE2D‘; simultaneous NEDDS8 linkage to a cullin and binding to 
the backside of RBX1-bound UBE2D*””; and the detrimental effects 
of bacterial-effector-catalysed NEDD8 GIn4.0 deamidation™ * (Figs. 2, 
3). Third, residues in ubiquitin that differ from those in NEDD8 would 
clashin the catalytic architecture (Fig. 3d, Extended Data Table 1), thus 
rationalizing the existence of NEDD8 asa distinct ubiquitin-like protein. 

Inthe absence of other factors, the scaffolding module of neddylated 
CRL1°T®® robustly bridges the substrate with the C/R domain, whereas 
NEDD8, its linked CULI WHB domain and the RING domain of RBX1 
are relatively dynamic and apparently mobile (Extended Data Fig. 2). 
These mobile entities are harnessed in the neddylated CRL1I° T° 
UBE2D-Ub-substrate intermediate (Fig. 2). The numerous requisite 
protein-protein interactions and conformational changes suggest 
that there could be several routes to the catalytic architecture (Fig. 4b), 
in which the formation of interfaces successively narrows the range 
of options—akin to progression down a free-energy funnel. Because 
ubiquitylation does occur with mutant substrates or enzymes, albeit 
at substantially lower rates (Extended Data Table 1), we cannot exclude 
that ubiquitin could be transferred from RING- and NEDD8-bound 
UBE2D in various orientations relative to the substrate-scaffolding 
module. However, if the thioester bond is both in the RING-activated 
configuration and adjacent to the substrate—as in the structure—this 
would increase the rate at which the presumably random exploration of 
three-dimensional space by a substrate lysine would lead to productive 
collision with the active site. Accordingly, reducing any single contri- 
bution to the structurally observed catalytic architecture increases 
the relative importance of other contacts—even lesser ones (Fig. 4a, 
Extended Data Table 1). 

Whereas neddylated CRL1°"® and UBE2D seem to be optimal for 
ubiquitin priming of peptide-like substrates, the limited effect of muta- 
tions on the linkage of subsequent ubiquitins (Extended Data Table 1) 
raises the possibility that different forms of ubiquitylation involve 
alternative—currently unknown—catalytic architectures. Although 
not overtly observed for our substrates with a single acceptor lysine, 
an outstanding question is whether there are other circumstances in 
which a substrate-linked ubiquitin could mimic NEDD8 and activate 
further ubiquitylation. Moreover, in addition to UBE2D, neddylated 
CRLs recruit a range of other ubiquitin-carrying enzymes—from ARIH- 
family RBR E3s for substrate priming to other E2s for polyubiquityla- 
tion”“" 8—and UBXD7, which in turn recruits the AAA-ATPase p97 to 
process some ubiquitylated substrates**. We speculate that these CRL 
partners uniquely harness the dynamic NEDD8§, its linked CUL WHB 
domain and/or the RING domain of RBX1 to specify distinct catalytic 
activities, much like the thioester-linked UBE2D-Ub intermediate 
captures neddylated CRL1I°™® through multiple surfaces to specify 
substrate priming. The malleability of neddylated CRLs—coupled with 
numerous ubiquitin carrying enzyme partners—may underlie success- 
ful molecular glue or PROTAC-targeted protein degradation, whereas 
potential limitations in their ability to achieve the optimal multiva- 
lent catalytic architectures may explain failures in such chemically 
directed ubiquitylation of heterologous substrates*. Additionally, it 
seems likely that the conformational dynamics of unneddylated CRLs 
(Extended Data Fig. 2) would enable transitioning between the differ- 
ent conformations coordinating cycles of neddylation-deneddylation 
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Fig. 4 | Multifarious interactions configuring rapid substrate priming. 

a, Effects of the indicated mutants—within the activation module, between 
activation and catalytic, activation and substrate-scaffolding and substrate- 
scaffolding and catalytic modules, alone or in combination—on the catalytic 
efficiency of substrate priming, as quantified by overall fold differencein 
kops/Km Compared with wild-type neddylated CRL1°"®? and UBE2D-catalysed 
ubiquitylation of a peptide substrate. Reactions with unneddylated CRLIP 8? 
serve as areference, and used CUL1(K720R) to prevent obscuring the 
interpretation of results by artefactual ubiquitin transfer to CUL1 and the 
resultant artefactual activation of substrate priming. Graphs show the average 
value from two different experiments (technical replicates), for which curve 
fits and values are provided in Extended Data Fig. land Extended Data Table1. 
b, Ontheir own, neddylated CRLI°™®° and UBE2D-Ub are dynamic, and at an 


with CAND1-driven substrate-receptor exchange’**** *’. Thus, the mul- 
tifarious nature of interactions and conformations that determine 
robust and rapid substrate priming—as revealed by the structure of 
the neddylated CRL1°"*-UBE2D-Ub-substrate intermediate—also 
provides a mechanism by which common elements can be transformed 
by different protein partners to interconvert between distinct CRL 
assemblies to meet the cellular demand for ubiquitylation. 
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Methods 


Cloning, protein expression and purification 
All proteins are of human origin. All variants of UBE2D2, UBE2D3, 
UBE2M, RBX1, CUL1, CUL4, NEDD8 and ubiquitin were generated using 
PCR, Quikchange (Agilent), or were synthesized by Twist Biosciences. 
UBE2D2 was purified as previously described®, and UBE2D3 was 
purified in a similar manner. Ubiquitin was expressed in BL21(DE3) RIL 
as previously described*. Wild-type CUL1, RBX1(5-C), SKP1, B-TRCP2, 
CUL4A (from residue 38 to C terminus, hereafter referred to as CUL4), 
CRBN, DDB1and UBA1 were cloned into pLIB vectors”. GST-TEV-RBX1 
and CUL1, GST-TEV-RBXI1 and CUL4A, His-TEV-B-TRCP2 and SKP1 or 
His-TEV-DDB1 and GST-TEV-CRBN were co-expressed by co-infecting 
with two baculoviruses. UBAI was cloned with an N-terminal GST-tag 
with a TEV cleavage site. These proteins were expressed in Trichoplu- 
sia ni High-Five insect cells, purified by either GST or nickel-affinity 
chromatography, overnight TEV cleavage, followed by ion-exchange 
and size-exclusion chromatography. All variants of CULI-RBX1 were 
purified similarly. Purification of NEDD8, UBE2M, APPBP1-UBA3, SKP1- 
FBW7 (from residue 263 to C terminus®), neddylation of CULI-RBX1, 
and fluorescent labelling of ubiquitin used for biochemical assays were 
performedas previously described”. B-TRCP1 (monomeric form, from 
residue 175 to C terminus”, hereafter referred to as B-TRCP1) with an 
N-terminal His-MBP followed by a TEV cleavage site was cloned into 
a pRSFDuet vector with SKP1AA (SKP1 with two internal deletions, 
of residues 38-43 and 71-82)**. SKP1IAA-B-TRCP1 was expressed in 
BL21(DE3) Gold E. coli at 18 °C, purified with nickel affinity chromatog- 
raphy, followed by TEV cleavage, anion exchange and size-exclusion 
chromatography. Modification of RBX1-CUL1 and RBX1-CUL4A by 
ubiquitin instead of NEDD8 was performed with the Ub(R72A) mutant 
that allows its activation and conjugation by neddylating enzymes 
APPBP1-UBA3 and UBE2M*>*. The reaction for ubiquitylating CUL4A- 
RBX1 was performed at pH 8.8 to drive the reaction to completion. 
The previously described Y130L mutant of UBE2M was used to modify 
CUL1-RBX1 with the 144A mutant of NEDD8”. IKZF1 ZF2 (residues 141- 
169, with two point mutations (K157R/K165R) and with a lysine added 
at position 140 to create a single target lysine at the N terminus®) was 
cloned with an N-terminal GST with a3C-Prescission cleavage site and 
anoncleavable C-terminal Strep-tag. IKZF1 ZF2 was purified by GST 
affinity chromatography, 3C-Prescission cleavage overnight, and size- 
exclusion chromatography. UBE4B RING-like U-box domain (residues 
1200-C terminus) containing D1268T and N1271T point mutations 
that enhance activity*’ (hereafter referred to as UBE4B) was cloned 
with an N-terminal GST with TEV cleavage site. UBE4B was purified by 
GST affinity chromatography, TEV cleavage overnight, followed by ion 
exchange and size-exclusion chromatography. 


Peptides 
All peptides were stated to be of >95% purity by HPLC and were used 
as received. 

Peptides used to quantify enzyme kinetics had the following 
sequences: IkBa, KERLLDDRHD(pS)GLD(pS)MRDEERRASY (obtained 
from New England Peptide); B-catenin short, KSYLD(pS)GIH(pS) 
GATTAPRRASY (obtained from Max Planck Institute of Biochemis- 
try Core Facility); B-catenin medium, KAWQQQSYLD(pS)GIH(pS) 
GATTTAPRRASY (obtained from New England Peptide); B-catenin 
long, KAAVSHWQQQSYLD(pS)GIH(pS)GATTAPRRASY (obtained 
from Max Planck Institute of Biochemistry Core Facility); B-catenin 
for sortase-mediated transpeptidation to ubiquitin to generate a 
homogeneously ubiquitin-linked substrate, GGGGYLD(pS)GIH(pS) 
GATTAPRRASY (obtained from Max Planck Institute of Biochemistry 
Core Facility). 

Peptides used for the qualitative assays monitoring substrate prim- 
ing—that is, fluorescent ubiquitin transfer from UBE2D-Ub to sub- 
strate—had the following sequences: IkBa, KKERLLDDRHD(pS)GLD(pS) 


MKDEE (as previously described"); CyE, KAMLSEQNRASPLPSGLL(pT) 
PPQ(pS)GRRASY (as previously described“). 

The nonmodifiable substrate analogue used in the competition 
experiment in Extended Data Fig. 3d had the following sequence: IkBa, 
RRERLLDDRHD(pS)GLD(pS)MRDEE (obtained from Max Planck Insti- 
tute of Biochemistry Core Facility). 

Peptides used in cryo-EM experiments are as follows: for the struc- 
ture representing neddylated CRL1°™®?-UBE2D-Ub-IkBa substrate 
described in detail and cryo-EM experiments shown in Extended Data 
Fig. 3e, f, h, i): IKBa, CKKERLLDDRHD(pS)GLD(pS)MKDEEDYKDDDDK 
(obtained from Max Planck Institute of Biochemistry Core Facility); 
for cryo-EM reconstructions of unnneddylated and neddylated 
CRL1°T®_1kBa substrate shown in Extended Data Fig. 2a, b: IkBa, 
KKERLLDDRHD(pS)GLD(pS)MKDEE (as previously described“). 


Enzyme kinetics 

UBE2D3 titrations under substrate single-encounter conditions 
for estimation of the K,, for E2 used by neddylated CRL1. These ex- 
periments used full-length CRLI°™°”? and UBE2D3, referred to here as 
CRL1*® and UBE2D. Fifty micromolar peptide substrate (for a list of 
peptides that were used in the assay, see ‘Peptides’) was radiolabelled 
with 5 kU of cAMP-dependent protein kinase (New England Biolabs) in 
the presence of [y”P]ATP for 1h at 30 °C. Two mixtures were prepared 
before initiation of the reaction: a UBA1/Ub mix containing unlabelled 
substrate competitor peptide that was identical in sequence to the 
labelled one (the one exception being the Ub-B-catenin substrate, in 
which the unlabelled B-catenin for sortase peptide was used); and aned- 
dylated or unneddylated (with the CUL1(K720R) mutant, to prevent low- 
level ubiquitylation by UBE2D*) CRL1°™®“’/labelled peptide substrate 
mix. The UBA1/Ub mix contained reaction buffer composed of 30 mM 
Tris-HCI, 100 mM NaCl,5mM MgCl,,2mM ATPand2mMDTT pHZ.5. The 
concentration of ubiquitin was 80 iM, with1 1M UBA1 and 100 pMun- 
labelled peptide. UBE2D was first prepared as atwofold dilution series 
froma variable stock concentration, then introduced individually into 
tubes containing equal amounts of the UBA1/Ub mix. The CRLIPT°/ 
labelled peptide substrate mix contained the same reaction buffer as 
the UBA1/Ub/UBE2D mix, 0.5 uM CRLI°"®®, and 0.2 1M labelled peptide 
substrate. The reactions were initiated at 22 °C by combining equal 
volumes of both mixes, rapidly vortexed and quenched after 10 sin 2x 
SDS-PAGE buffer containing 100 mM Tris-HCl, 20% glycerol, 30 mM 
EDTA, 4% SDS and 4% B-mercaptoethanol pH 6.8. Each titration series 
was performed in duplicate and resolved on hand-cast, reducing 18% 
SDS-PAGE gels. The gels were imaged on a Typhoon 9410 Imager and 
quantification of substrate and products was performed using Image 
Quant (GE Healthcare). The product of each lane was measured as the 
fraction of the ubiquitylated products divided by the total signal, plot- 
ted against the UBE2D concentration, and fit to the Michaelis-Menten 
equation to estimate K,, (GraphPad Prism software). The standard error 
was calculated using Prism and has been provided in Extended Data 
Table 1 for all estimates of K,,. 


Estimating the rates of ubiquitin transfer to CRL1-bound substrate 
using pre-steady-state kinetics. These experiments used full-length 
CRL1°T®°? and UBE2D3, referred to here as CRL1°™® and UBE2D, re- 
spectively. Separate UBA1/UBE2D/Ub and CRL1°"*/abelled peptide 
substrate mixes were prepared to assemble single-encounter ubiq- 
uitylation reactions. For most reactions, the UBA1/UBE2D3/Ub mix 
contained reaction buffer, 80 tM Ub, 1 uM UBAI, 40 pM UBE2D, and 
200 uM unlabelled competitor substrate peptide. For all reactions 
containing either CUL1(K720R) or UBE2D(H32A) assayed with wild-type 
neddylated CUL1, CUL1 modified either with the mutant NEDD8(144A) 
or with Ub(R72A) permitting ligation to CUL1, 120 uM Ub and 70 uM 
UBE2D were used. The CRL1°™*/labelled peptide substrate mixes 
contained reaction buffer, 0.5 pM CRLI8T, and 0.2 uM labelled pep- 
tide (for a list of peptides that were used in the assay, see ‘Peptides’). 


Article 


Each mix was separately loaded into the left or right sample loops ona 
KinTek RQF-3 quench flow instrument, and successive time points were 
taken at 22 °C by combining the mixtures with drive buffer composed 
of 30 mM Tris-HCI and 100 mM NaCl pH 7.5. Reactions were quenched 
at various time points in 2x SDS-PAGE buffer to generate the time 
courses. Substrate and products from each time point were resolved 
on hand-cast, reducing 18% SDS-PAGE gels. The gels were imaged on 
a Typhoon 9410 Imager, and substrate and product bands were indi- 
vidually quantified as a percentage of the total signal for each time 
point using ImageQuant (GE Healthcare). Reactions were performed 
in duplicate, and the average of each substrate or product band was 
used for the analysis. The data for substrate (SO) or mono-ubiquitylated 
product (S1) bands were fit to their respective closed-form solutions as 
previously described” using Mathematica to obtain the values k,,.° © 
andk,,.~" ? (Extended Data Table 1). The standard errors were calculated 
in Mathematica and have been provided in Extended Data Table 1 for 
all estimates of k,,.. 


Multiturnover assays with short B-catenin. The multiturnover as- 
say showing the ubiquitin transfer to short B-catenin (Extended Data 
Fig. 6c, d) was performed as described in ‘Estimating the rates of ubiq- 
uitin transfer to CRL1-bound substrate using pre-steady-state kinet- 
ics’, but without the excess unlabelled B-catenin peptide substrate. 
Time points were collected by quenching in 2x SDS-PAGE loading 
buffer. Substrate and product were separated by SDS-PAGE followed 
by autoradiography. The fraction of unmodified substrate (SO) was 
quantified and fit to either a one-phase decay (unneddylated) or lin- 
ear (neddylated) model (Prism 8). Similarly, products containing five 
or more ubiquitins were quantified and fit to either an exponential 
growth (neddylated) or linear (unneddylated) model. Experiments 
were performed in duplicate. 


Generation of ubiquitylated B-catenin fusion via sortase reaction. 
Amimic of ubiquitylated B-catenin was generated by fusing a ubiquitin 
witha C-terminal LPETGG with a GGGG-B-catenin peptide. The reaction 
was incubated with concentrations of 50 1M UB'"™'°S, 300 uM GGGG- 
B-catenin peptide, and10 pM 6xHis-Sortase A for 10 minin50 mM Tris, 
150 mM NaCl, 10 mM CaCl, pH8.0. Sortase A was removed by retention 
onnickel resin, and the product was further purified by size-exclusion 
chromatography in 25 mM HEPES, 150 mM NaCl, 1mM DTT at pH 7.5. 
The sequence of ubiquitin with sortase motif was as follows: MQIFVK- 
TLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLS- 
DYNIQKESTLHLVLRLRGSGSGSLPETGG. 


Other biochemical assays 

For experiments comparing the activity of neddylated with unned- 
dylated CRLs or CUL-RBX1 complexes, in the unneddylated versions 
the NEDD8 modification sites of CUL1 and CUL4 were mutated to Arg 
(CUL1(K720R) and CUL4A(K705R)) to prevent obscuring interpretation 
of the results by low-level ubiquitylation of the NEDD8 consensus Lys 
during the ubiquitylation reactions’. 


Substrate priming assays. Experiments in Extended Data Fig. 8a-f 
used full-length CRL1°™®°*? and UBE2D3. Ubiquitylation of IkBa by 
CRL1°®® via UBE2D3 was monitored using a pulse-chase format that 
specifically detects CRL’™-dependent ubiquitin modification from 
UBE2D to IkBa independently of effects on UBA1-dependent forma- 
tion of the UBE2D3~-Ub intermediate. The pulse reaction generated a 
thioester-linked UBE2D~Ub intermediate and contained 10 uM UBE2D, 
15 uM fluorescent ubiquitin, 0.2 1M UBA1in 50 mM Tris, 50 mM NaCl, 
2.5mM MgCl, 1.5mM ATP pH 7.5 incubated at room temperature for 
10 min. The pulse reaction was quenched with 25 mM EDTA onice for 
5 min, then further diluted to 100 nM UBE2D in 25 mM MES, 150 mM 
NaCl pH 6.5 for subsequent mixture with components of the reac- 
tion for neddylated CRL1°™-dependent ubiquitin transfer to the 


substrate in the chase reaction. The chase reaction mix consisted of 
400 nM CRL (NEDD8-CUL1-RBX1-SKP1-B-TRCP), and 1M substrate 
(phosphorylated peptide derived from IkBa) in 25 mM MES, 150 mM 
NaCl pH 6.5 incubated on ice. After the quench, the pulse reaction 
mix was combined with the chase reaction mix ata 1:1 ratio onice. The 
final reaction concentrations were 50 nM UBE2D (in thioester-linked 
UBE2D-Ub complex) and 200 nM neddylated CRL1°"®* to catalyse sub- 
strate ubiquitylation. Samples were taken at each time point, quenched 
with 2x SDS-PAGE sample buffer, protein components were separated 
onnonreducing SDS-PAGE, and the gel was scanned on an Amersham 
Typhoon imager (GE Healthcare). 

Substrate priming reactions assaying the effects of variations in 
UBE2D shown in Extended Data Fig. 8g-i on CRL1™”’-dependent ubiq- 
uitylation—a phosphopeptide derived from CyE—were performed simi- 
larly to those for CRL19"® as described above with 100 nM UBE2D-Ub 
(based on concentration of UBE2D from the pulse reaction), 500 nM 
neddylated CULI-RBX1-SKP1-FBW7 (residues 263 to the C terminus), 
and 2.5 uM CyE phosphopeptide in 25 mM HEPES, 150 mM NaCl pH 7.5 
at room temperature. Experiments testing the effects of variations in 
NEDD8 (or its substitution with Ub(R72A)) were performed inthe same 
manner, except with 250 nM NEDD8 (or variant)-modified CULI-RBX1- 
SKP1-FBW7 (residues 263 to the C terminus). 

Our assay for CRL4°®§ ubiquitylation of IKZF was established on 
the basis of findings that ZF2 mediates tight immunomodulatory- 
drug-dependent interactions sufficient to target degradation, and 
UBE2D3 contributes to the stability of CRL4“°®’ neomorphic substrates 
incells*”°°, Substrate priming reactions showing controls and assay- 
ing effects of variations in NEDD8 and UBE2D3 shown in Extended Data 
Fig. 8j-o monitored ubiquitylation of IKZF ZF2 with 400 nM UBE2D-Ub 
(concentration determined by that of UBE2D inthe chase reaction), 500 
nM NEDD8-CUL4-RBX1-DDB1-CRBN, 5 uM pomalidomide and 2.5 1M 
IKZF ZF2 in 25 mM HEPES, 150 mM NaCl pH 7.5 at room temperature. 
Effects of swapping NEDD8 for Ub(R72A) on the CRL4°®®-mediated 
ubiquitylation of IKZF ZF2 are shown in Extended Data Fig. 8m and 
were performed similarly but with 100 nM UBE2D-Ub, 250 nM NEDD8 
or Ub-modified CUL4-RBX1-DDB1-CRBN, 2.5 tM pomalidomide and 
1.25 uUMIKZF ZF2. 


Assays for intrinsic activation of UBE2D~Ub intermediate. Assays 
shown in Extended Data Figs. 3b, g, 7d-f—monitoring neddylated 
CUL1-RBX1 activation of the thioester-linked UBE2D-Ub intermedi- 
ate (thatis, inthe absence of substrate)—used the RBX1(N98R) variant 
that is hyperactive towards UBE2D-Ub”®. Experiments in Extended 
Data Figs. 3b, g, 7f were performed in pulse-chase format similar to 
substrate-priming assays, but with 9 uM UBE2D-Ub (loading reaction 
with 20 uM UBE2D, 30 uM Ub and 0.5 pM UBAI), 500 nM E3 and 5 mM 
free lysine. For unneddylated CULI-RBX1- or UBE4B-dependent dis- 
charge, 50 mM free lysine was used instead. Discharge assays shown in 
Extended Data Fig. 7d, e used 5 1M UBE2D-Ub (loading reaction with 
20 uM UBE2D, 20 pM Ub and 0.5 uM UBA1), 500 nM E3 and 10 mM free 
lysine for neddylated CUL1I-RBX1-dependent discharge, and 50 mM 
free lysine for unneddylated CUL1-RBX1-dependent discharge. All 
assays were visualized by Coomassie-stained SDS-PAGE. 


Generation of a stable proxy for the UBE2D-Ub-substrate 
intermediate 

Preparation of His-TEV-Ub(1-75)-MESNa. His-TEV-Ub(1-75) was 
cloned using a previously described method” into pTXB1 (New Eng- 
land Biolabs) and transformed into BL21(DE3) RIL. Cells were grownin 
terrific broth at 37 °C to an optical density at 600 nm (OD¢o9) of 0.8 and 
then induced with IPTG (0.5 mM), shaking overnight at 16 °C. The col- 
lected cells were resuspended (20 mM HEPES, 50 mM NaOAc, 100 mM 
NaCl, 2.5 mM PMSF pH 6.8), sonicated and then centrifuged (50,000g, 
4°C, 30 min). Ni-NTA resin (1 millilitre resin per litre of broth, Sigma 
Aldrich) was equilibrated with the resuspension buffer and incubated 


with the cleared lysate at 4 °C onaroller (30 rpm) for 1h. The resin was 
then transferred toa gravity column and washed (5 x 1column volume 
with 20 mM HEPES, 50 mM NaOAc, 100 mM NaCl pH 6.8). Protein was 
then eluted (5 x 1column volume with 20 mM HEPES, 50 mM NaOAc, 100 
mM NaCl1300 mM imidazole pH 6.8). Ubiquitin was then cleaved from 
the chitin-binding domain by diluting the eluted protein 10:1 (v/v) with 
20 mM HEPES, 50 mM NaOAc, 100 mM NaCl, 100 mM sodium 2-mer- 
captoethanesulfonate (Sigma Aldrich) pH 6.8. This solution was incu- 
bated at room temperature overnight ona roller (30 rpm). Ub-MESNa 
was finally purified by size-exclusion chromatography (SD75 HiLoad, 
GE Healthcare) equilibrated with 12.5 mM HEPES, 25 mM NaCl pH 6.5. 

Sequence of His-TEV-Ub(1-75)-chitin-binding domain: 

MGSSHHHHHHENLYFQGSGGMQIFVKTLTGKTITLEVEPSDTIEN- 
VKAKIQD 

KEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGCFAK- 
GTNVL 

MADGSIECIENIEVGNKVMGKDGRPREVIKLPRGRETMYSVVQKSQH- 
RAH 

KSDSSREVPELLKFTCNATHELVVRTPRSVRRLSRTIKGVEYFEVIT- 
FEMGQ 

KKAPDG 


Native chemical ligation to make Ub(1-75)-Cys-IkBa. His-Ub(1-75)- 
MESNa (200 pM final concentration) and freshly dissolved IkBa peptide 
(H-CKKERLLDDRHDpSGLDpSMKDEEDYKDDDDK-OH) (1,000 uM final 
concentration) were combined in a1.5-ml tube in50 mM NaPO,,50 mM 
NaCl pH 6.5. This was incubated with rocking at 30 rpm for 1h at room 
temperature before TCEP was added to1mM. After rocking for an addi- 
tional hour at room temperature, the reaction was quenched by adding 
500 mM NaPO, pH 8.0 to 45 mM. The entire solution was then incubated 
with Ni-NTA resin (300 pl for al ml reaction) at30rpmfor1hat4 °C. Ina 
gravity column, the resin was then washed with 6 x 300 p150 mM NaPO,, 
50 mM NaCl, 1mM B-mercaptoethanol pH 8.0. Protein was eluted with 50 
mM NaPO,,50 mM NaCl, 1mM B-mercaptoethanol, 300 mM imidazole 
pH8.0. Fractions were analysed by SDS-PAGE and nanodrop. 


Formation of disulfide linkage between UBE2D Cys85 and Ub(1-75)- 
Cys-IkBa. The same approach was used to generate complexes for 
UBE2D2 and UBE2D3, referred to collectively as UBE2D. UBE2D(C21I/ 
C107A/C111D) was purified from size-exclusion chromatography (see 
‘Cloning, protein expression and purification’) and then immediately 
used without freezing. After size-exclusion chromatography, the pro- 
tein was concentrated (Amicon, EMD Millipore) to 600 pM. Protein 
(2 x 100 pl) was separately desalted (2 x Zeba, 0.5 ml column, 7,000 
molecular weight cut-off filter, Thermo Fisher) to 20 mM HEPES, 250 
mM NaCl, 5 mM EDTA pH 7.0. Elutions were combined and immedi- 
ately added together to 34 p11 10 mM S,5’-dithiobis-(2-nitrobenzoic 
acid) (Sigma Aldrich, dissolved in 50 mM NaPO, pH 7.5) and mixed 
by pipetting before incubating at room temperature for 30 min. The 
solution was then desalted (2 x Zeba, 0.5 ml column, 7,000 molecular 
weight cut-off filter, Thermo Fisher) to 20 mM HEPES, 250 mM NaCl, 
5mMEDTA pH 7.0 at the same time that Ub(1-75)-Cys-IkBa (500 pl at 
100 pM) was desalted (1 x Zeba, 2 ml column, 7,000 molecular weight 
cut-off filter, Thermo Fisher) tothe same buffer. The UBE2D and ubiq- 
uitin components were then immediately combined and incubated at 
room temperature for 30 min, at which point the sample was loaded to 
a Superdex 75 Increase column (GE Healthcare) equilibrated with 20 
mM HEPES, 250 mM NaCl, 5 mM EDTA pH 7.0. 


Comparing the ability of stable proxy for the UBE2D~Ub-substrate 
intermediate and subcomplexes to compete with ubiquitylation. 
Assays comparing ubiquitylation in the presence of competitors (stable 
proxy for the UBE2D~Ub-substrate intermediate, stable isopeptide- 
linked mimic of UBE2D-Ub, and nonmodifiable substrate peptide) 
were carried out similarly to that described for our substrate priming 


assay described in ‘Substrate priming assays’ with the following modi- 
fications. The assay was performed in pulse-chase format to exclude 
the potential for competitors to affect generation of the UBE2D~Ub 
intermediate. Inthe pulse reaction, a thioester-linked UBE2D-Ub inter- 
mediate was generated by incubating 10 pM UBE2D, 15 uM fluorescent 
Ub, and 0.2 uM UBAI1 in a buffer that contained 50 mM Tris, 50 mM 
NaCl, 2.5 mM MgCl, and 1.5mM ATP pH 7.6 at room temperature for 10 
min. The pulse reaction was next quenched by the addition of an equal 
volume of 50 mM Tris, 50 mM NaCl, 50 mM EDTA pH 7.6 and placed on 
ice for 5 min, then further diluted to 100 nM ina buffer containing 25 
mM MES pH 6.5 and 150 mM NaCl. The E3-substrate mix consisted of 
400 nM NEDD8-CUL1-RBxX1, 400 nM SKP1-B-TRCP, 1 1M IkBa pep- 
tide, with or without 1 1M competitor in a buffer consisting of 25 mM 
MES and 150 mM NaCl pH 6.5, and was incubated at 4 °C for 10 min to 
achieve equilibrium. Reactions were initiated on ice by the addition of 
an equal volume of pulse reaction to the E3-substrate mix, resulting in 
final reaction conditions of 50 nM UBE2D-Ub, 200 nM E3 neddylated 
CRLIETR° and 500 nM substrate with or without 500 nM competitor. 
Samples were taken at the indicated time points and quenched with 
2x SDS-PAGE sample buffer. Substrate and products were then sepa- 
rated by SDS-PAGE, and subsequently visualized using an Amersham 
Typhoon imager (GE Healthcare). 


Early attempt to visualize ubiquitin transfer by neddylated 
CRL1°78°? and UBE2D and rationale for approaches to improve 
electron microscopy samples. In our initial attempt to determine a 
structure visualizing ubiquitin transfer by neddylated CRL1°™° and 
UBE2D, we used a full-length B-TRCP ($-TRCP2), which is a homodi- 
mer”, anda proxy fora UBE2D-Ub-substrate intermediate based ona 
method used to capture a Sumoylation intermediate™. In the previous 
study, SUMO was installed via an isopeptide bond ona residue adjacent 
to the E2 catalytic Cys, and substrate was crosslinked to the E2 Cys via 
an ethanedithiol linker. Here, we introduced the corresponding lysine 
substitution in the background of an optimized E2 (UBE2D2(L119K/ 
C211/C107A/C111D)). Using high concentrations of UBA1 and high pH, 
we generated an isopeptide-bonded complex between ubiquitin and 
this UBE2D2 variant in a manner dependent on the L119K mutation, 
and crosslinked the Cys of the IkBa substrate mimic peptide to the 
UBE2D-Ub complex using EDT as previously described”. The resultant 
cryo-EM data map, shown in Extended Data Fig. 3e, presented two ma- 
jor challenges. First, the dimer exacerbated structural heterogeneity, 
with minor differences presumably based on natural motions between 
the two protomers. Second, the donor ubiquitin was poorly visible, 
presumably owing to it not being linked to the catalytic Cys. Thus, we 
generated many samples in parallel to overcome these challenges by 
(1) using a monomeric version of B-TRCP1 that had previously been 
crystallized”°; (2) removing two loops in SKP1 that are known to be 
flexible and not required for ubiquitylation activity (although they are 
required for CAND1-mediated substrate-receptor exchange)”*™; (3) 
devising achemical approach to synthesize a proxy for the UBE2D~Ub- 
substrate intermediate in which all three entities are simultaneously 
linked to the E2 catalytic Cys; (4) using a point mutant version of RBX1 
that is hyperactive for substrate priming with UBE2D but defective for 
ubiquitin chain elongation with UBE2R-family E2s”. Notably, witha K,, 
for UBE2D3 of 350 nM and rates of ubiquitylating the medium B-catenin 
peptide substrate of 8.9 s 1 (SO-S1) and 0.2571 (S1-S2), we confirmed 
that the monomeric version of neddylated CRL1°"®“ is kinetically in- 
distinguishable from full-length, homodimeric neddylated CRLIPT?, 


Cryo-EM 

Sample preparation. For neddylated or unneddylated CULI-RBX1- 
SKP1-B-TRCP-IkBa samples, subcomplexes were mixed in an equimolar 
ratio with 1.5-fold excess substrate peptide, incubated for 30 min on 
ice and purified by size-exclusion chromatography in 25 mM HEPES, 
150 mM NaCl, 1mM DTT pH7.5. The complex was further concentrated 
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and crosslinked by GraFix®*. The sample was next liberated of glycerol 
using Zeba Desalt Spin Columns (Thermo Fisher), concentrated to 0.3 
mg mI, and 3 pl of sample was applied to R1.2/1.3 holey carbon grids 
(Quantifoil) and was plunge-frozen by Vitrobot Mark IV in liquid ethane. 
Structure determination of the neddylated CUL1-RBX1-SKP1-B-TRCP- 
Ub~UBE2D-IkBa complex used a similar method as above, with1.5-fold 
excess of the stable proxy for the UBE2D-Ub-IkBa intermediate, but 
with noDTT in buffer. After SEC, GraFix, and desalting, 3 pl of 0.08 mg 
ml“ sample was applied to graphene oxide-coated Quantifoil R2/1 
holey carbon grids (Quantifoil)® and was plunge-frozen by Vitrobot 
Mark IV in liquid ethane. 


Electron microscopy. Datasets were collected ona Glacios cryo trans- 
mission electron microscope at 200 kV using a K2 Summit direct detector 
in counting mode. For the CULI-RBX1-SKP1-B-TRCP1AD-IkBa dataset, 
6,433 images were recorded at 1.181 A per pixel with anominal magnifica- 
tion of 36,000x. A total dose of 60 e- A*was fractionated over 50 frames, 
with a defocus range of —1.2 um to -3.3 pm. For NEDD8-CUL1-RBX1- 
SKP1-B-TRCP1IAD-IkBa, 2,061images were recorded at 1.885 A per pixel 
with anominal magnification of 22,000x. A total dose of 59 e A? was 
fractionated over 38 frames, with a defocus range of -1.2 um to-3.3 pm. 

Datasets were also collected on a Talos Arctica at 200 kV using a 
Falcon II direct detector in linear mode. For each sample, around 800 
images were recorded at 1.997 A per pixel with anominal magnification 
of 73,000x. A total dose of approximately 60 e A? was fractionated 
over 40 frames, with a defocus range of -1.5 um to -3.5 pm. 

High-resolution cryo-EM data were collected on a Titan Krios elec- 
tron microscope at 300 kV witha Quantum-LS energy filter, using a K2 
Summit direct detector in counting mode. 9,112 images were recorded 
at 1.06 A per pixel with a nominal magnification of 130,000x. A total 
dose of 70.2 e A? was fractionated over 60 frames, with a defocus 
range of -1.2 um to -3.6 pm. 


Data processing. Frames were motion-corrected using RELION-3.0° 
with dose weighting. Contrast transfer function was estimated using 
CTFFIND”. Particles were picked with Gautomatch (K. Zhang, MRC 
Laboratory of Molecular Biology). Two-dimensional classification 
was performed in RELION-3.0, followed by 3D ab initio model build- 
ing by sxviper.py from SPARX“. The initial model from sxviper.py was 
imported to RELION-3.0 for further 3D classification, refinement, post- 
processing and particle polishing using frames 2-25. 


Protein identification and model building. The final reconstructions 
displayed clear main chain and side chain densities, which enabled us to 
modeland refine the atomic coordinates. Known components (CULI- 
RBX1, RCSB Protein Data Bank codes (PDB) 1LDJ and 4P50; SKP1AA-B- 
TRCPI1, PDB 1P22 and 6M90; UBE2D-Ub with a backside-bound ubiquitin 
to bereplaced by NEDD8 sequences, PDB 4V3L) were manually placed 
as a whole or in parts and fit with rigid-body refinement using UCSF 
Chimera”. The resultant complete structure underwent rigid-body 
refinements in which each protein or domain was allowed to move 
independently. Further iterative manual model building and real space 
refinements were carried out until good geometry and map-to-model 
correlation was reached. Manual model building and rebuilding were 
performed using COOT”, and Phenix.refine” was used for real space 
refinement. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


The atomic coordinates and electron microscopy maps have been 
deposited in the PDB with accession code 6TTU and the Electron 


Microscopy Data Bank with codes EMD-10585, EMD-10578, EMD-10579, 
EMD-10580, EMD-10581, EMD-10582 and EMD-10583. Uncropped gel 
source data are included as Supplementary Information. All other rea- 
gents and data (for example, raw gels of replicate experiments and raw 
movie electron microscopy data) are available from the corresponding 
author upon request. 
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Extended Data Fig. 1| See next page for caption. 
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Extended Data Fig. 1| Quantitative pre-steady-state enzyme kinetics of 
neddylated CRL1°*®°. and UBE2D-dependent ubiquitylation. Gel images are 
representative of independent technical replicates (n= 2); the symbols onthe 
graphs show the data from independent experiments (n=2).a, Autoradiogram 
of SDS-PAGE gel showing products of ubiquitylation reactions under single- 
encounter conditions for the interaction of radiolabelled substrate (medium 
B-catenin substrate peptide derived from B-catenin) with neddylated 
CRL1°"®° titrating UBE2D3 (hereafter denoted UBE2D). Each lane represents a 
single ubiquitylation reaction that was used to estimate the fraction of peptide 
that had been converted into ubiquitylated products asa function of UBE2D 
concentration. b, Plots of the fraction of substrate that had been converted to 
ubiquitylated products against UBE2D concentration for ubiquitylation 
reactions containing either wild-type UBE2D (as shown ina), or the mutants 
UBE2D(S22R) or UBE2D(H32A). Various CRL1°"® complexes were assayed that 
contained either wild-type neddylated CRLI°™®® (red), CRLIS™ complexes 
modified by NEDD8 variants containing 144A (orange), Q40E (green) or 
‘ubiquitylizing’ (L2Q/K4F/E14T/D16E/G63K/G64E) substitutions (blue), or 
CRL1°"®° modified by Ub(R72A) that is competent for ligation to CUL1 (purple) 
or unmodified CRL1°™® (CUL1 with the neddylation-site mutation K720R, 
black). Duplicate data points from independent experiments performed with 
identical samples are shown and were fit to the Michaelis-Menten model to 
estimate the K,, of UBE2D for CRL1°"® using nonlinear curve fitting (GraphPad 
Prism). c, Plots of the fraction of substrate that had been converted to 
ubiquitylated products against UBE2D concentration for ubiquitylation 
reactions with various substrate peptides: derived from IkBa (but witha single 
acceptor Lys); derived from B-catenin; derived from B-catenin with different 
spacing between the phosphodegron motif and a potential acceptor Lys 
(amedium B-catenin substrate peptide with a nine-residue spacer between 


the B-catenin phosphodegron and acceptor, matching the relative position of 
these moieties in IkBa; and ashort B-catenin substrate in which the four 
residues between these moieties are too few to bridge the structurally 
observed gap between the substrate receptor and UBE2D-Ub active site); 
anda homogeneous ubiquitin linked-B-catenin generated by sortase- 
mediated transpeptidation wherein the only lysines are from ubiquitin. 

d, Autoradiogram of SDS-PAGE gel showing results from rapid quench-flow 
reactions under pre-steady-state single encounter conditions for the 
interaction of radiolabelled substrate (a medium B-catenin phosphopeptide) 
with CRL1°"®, The representative raw data are froma reaction using wild-type 
UBE2D and wild-type neddylated CRL1°"®°, and show time-resolved 
conjugation of increasing numbers of individual ubiquitin molecules. 

SO, substrate with O ubiquitins; S1, substrate with one ubiquitin; S2, substrate 
with two ubiquitins; and so on. e, Plots comparing various substrate peptides 
described inc, showing disappearance of unmodified substrate (SO) with black 
circles, and the appearance of mono-ubiquitylated substrate (S1) with grey 
triangles, in rapid quench-flow reactions all performed as ind and under single- 
encounter conditions as ina. Duplicate data points from independent 
experiments performed with identical samples are shown. The data were fit to 
closed form equations (Mathematica) as previously described to obtain both 
the rates for the transfer of the first ubiquitin to substrate (k,,°° “) and of the 
second ubiquitin to the singly ubiquitin-modified substrate (K,,,° ©) as well as 
their associated standard error (Extended Data Table 1). f, Plots from 
experiments performed and analysed as described ine, except with 
radiolabelled medium B-catenin peptide substrate, CRL1°'® variants 
containing the indicated versions of CUL1-RBX1, and with either wild-type or 
indicated mutant versions of UBE2D. 
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corresponding to the percentage of particles in that 3D class. b, As ina but with 
neddylated CRL1°'®°, with its surfaces outlined in different classes 
encompassing the RING domain of RBX1, the WHB domain of CULI, and 
covalently modified NEDD8. c, Refined cryo-EM density from CRL1°"®° reveals 
the substrate-scaffolding module bridging the substrate recruited to substrate 
receptor B-TRCP with the intermolecular C/R domain, readily fitted with 
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neddylated CRL1°'’-catalysed ubiquitin transfer from E2 UBE2D to an IkBa- 
derived substrate peptide. Right, semi-transparent version of the schematic, 
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Extended Data Fig. 3| See next page for caption. 
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Extended Data Fig. 3 | Generation ofa stable proxy for the UBE2D~Ub- 
substrate intermediate, and characterization in complexes with 
neddylated CRL1°*®° by cryo-EM and biochemistry. Gel panels in this figure 
are representative of two independent experiments; n=2.a, Our strategy for 
trapping a mimic of the transient neddylated CRL E2~Ub-substrate complex 
requires that the E2 UBE2D contain only asingle cysteine at the active site. 
However, UBE2D contains three additional cysteines (Cys21, Cys107 and 
Cys111). Standard replacements of cysteine by serine or alanine severely 
compromised activity. On the basis of the structural locations of these 
cysteines, we presumed that their mutation hindered formation of the RING- 
activated, closed, active UBE2D~Ub conformation”®*°. We thus deviseda 
systematic structure- and random-based approach to identify suitable 
replacements that qualitatively maintain wild-type levels of activity with 
neddylated CRLs. Structural analysis showed that Cys21 and Cys107 areinclose 
proximity, such that mutation of both residues to alanine may generatea 
destabilizing cavity at this site. Combining UBE2D2(C107A) with Cys21 mutated 
toisoleucine, leucine or valine to compensate for the reduced hydrophobic 
volume led to the identification of C21I(C107A) asa suitable version for testing 
all other possible replacements for Cys111. A similar approach was taken for 
UBE2D3.A total of 48 different versions of UBE2D were tested to identify the 
UBE2D(C211/C107A/C111D) mutant for chemical trapping at the remaining 
active site cysteine. b, Top, schematic of pulse-chase assay testing intrinsic 
activation of thioester-linked UBE2D~-Ub intermediates. Although this is often 
tested by monitoring RING-dependent discharge of ubiquitin from UBE2D to 
free lysine, RBX1RING-dependent activity is limited in this assay owing to 
sequence constraints imposed by the requirements for binding to partners 
other than UBE2D”. Nonetheless, substrate-independent activation of 
UBE2D-Ub can be readily visualized using CUL1 complexed witha previously 
described hyperactive mutant RBX1(N98R)”, and high enzyme and lysine 
concentrations. UBE2D~-Ub generated ina pulse reaction was mixed with 
NEDD8-modified CULI-RBX1 (shown here with the N98R mutant) and free 
lysine, and ubiquitin discharge was monitored over time by Coomassie-stained 
SDS-PAGE (as shown by the representative gel at the bottom) demonstrating 
that standard serine or alanine mutations of noncatalytic cysteines 
compromised activity (shown for the mutant C21A/C107A/C111S), whereas the 
optimized mutant (C211/C107A/C111D) retains activity similar to that of the wild 
type. c, Overview of the generation of our stable proxy for the phosphorylated 
IkBa substrate intermediate linked at a single atom, and comparison to the 
previous method used to visualize noncanonical Lys sumoylation®™. 

d, Experiment validating our stable proxy for the UBE2D-Ub-phosphorylated 


IkBa substrate intermediate linked at a single atom, based on the hypothesis 
that its simultaneous occupation of the binding sites for the UBE2D-Ub 
intermediate and substrate should result in more potent inhibition ofa 
neddylated CRL1°™®°’- dependent substrate priming reaction compared to the 
individual constituents of the complex. e, Cryo-EM reconstruction of 
neddylated CRL15"®*? (with full-length, dimeric B-TRCP2) bound toa mimic of 
UBE2D2-Ub-IkBa generated by adapting the method used previously to 
visualize noncanonical lysine sumoylation®. Ubiquitin is isopeptide-bonded to 
the substituted residue of a UBE2D(L119K) mutant, and acysteine residue that 
replaces the acceptor inthe substrate is disulfide-bonded to the catalytic 
cysteine of UBE2D2. This electron microscopy map visualizes the catalytic 
architecture of dimeric CRLI°™? in which the dimerization domain agrees 
well with the previous crystal structure”, and its linked NEDDS8 (circled in 
yellow) is bound to the backside of UBE2D, but the donor ubiquitin (absent 
from the region circled in orange) was not visible—presumably owing to 
inadequacies of the method used to generate this mimic of the catalytic 
intermediate, in which the ubiquitin and substrate are not both simultaneously 
linked to the UBE2D catalytic cysteine. Variations between the two protomers 
of the dimer also exacerbated sample heterogeneity. f, Cryo-EM reconstruction 
of neddylated CRL1®™8°"!4° (with monomeric version of B-TRCP1, from residue 
175 to the C terminus”°) bound to our newly developed proxy for the 
UBE2D3-Ub-IkBa intermediate. The phospho-IkBa peptide-substrate-bound 
B-TRCP-SKP1-CUL1-RBX1-NEDD8-UBE2D portion of this map superimposes 
with the map for the dimeric complex shown ine, but here the entire complex is 
visible—including both the NEDD8 (circled in yellow) and donor ubiquitin 
(circled in orange). g, To further increase cryo-EM sample homogeneity, we 
considered that the RBX1 RING sequence represents a compromise to meet 
requirements for its many different catalytic activities achieved with 
neddylation E2s, various ubiquitin carrying enzymes, and regulators including 
the inhibitor GLMN”. Therefore, we introduced a second RBX1 linchpin residue 
via mutation (N98R), which has previously been shown toimprove neddylated 
CRLand UBE2D-dependent substrate priming at the expense of other RBX1- 
dependent functions (for example, with UBE2M and UBE2R2)”. A Coomassie- 
stained SDS-PAGE gel from an assay for the intrinsic activity of UBE2D-Ub is 
shown, showing enhanced neddylated CRL-dependent activation of discharge 
to free lysine with the RBX1 N98R mutation. h, i, Cryo-EM reconstructions of 
neddylated CRLIP T°! with RBX1(N98R) bound to our newly developed 
proxies for the UBE2D3~Ub-IkBa and UBE2D2-Ub-IkBa intermediates, the 
latter of which was pursued for high-resolution electron microscopy (final 
reconstruction refined to 3.7 Aresolution, shown on right). 
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Extended Data Fig. 4 | Flowchart showing the stages of cryo-EMimage 
processing. a, Cryo-EM image-processing flow chart. Ultimately, 
reconstruction of the data yielded a focused refinement at 3.46 A resolution 
anda global refinement at 3.7 A resolution that superimposes well with lower- 
resolution maps that were obtained during attempts to visualize substrate 
priming with neddylated wild-type dimeric CRL1°"”. b, Two-dimensional 


classes representing particles used for final reconstructions. c, Angular 
distribution of final reconstruction. d, Gold-standard Fourier shell correlation 
(FSC) curve showing overall resolution at 3.72 Aat an FSC of 0.143. e, Electron 
microscopy density map coloured by local resolution. NEDD§8, circled in yellow, 
is the entity displaying the highest local resolution inthe map. 
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Extended Data Fig. 5 | Extraordinary cullin-RING conformational changes in 
catalytic architecture juxtaposing the substrate and the active site of 
ubiquitylation. a, Side-by-side comparison of relative RING-domain locations 
in different CRL complexes after superposition of the C/R domains from the 
original CULI-RBX1 structure (PDB: 1LDJ, ‘pre-neddylation’—which data herein 
show is dynamic—although the crystal structure probably captured the 
conformation that enabled CAND1 binding and substrate receptor exchange)’, 
the structure representing the neddylation reaction (PDB: 4P50)”, anda 
structure of aneddylated CULS-RBX1 domain (PDB: 3DQV, labelled ‘post- 
neddylation’, which revealed the potential for conformational changes in the 
neddylated CUL WHB- and RBX1 RING-domains’, and data herein shows is 
dynamic), and the structure presented here showing howthe neddylated CUL1 
WHB domain and RBX1RING domainare harnessed ina catalytic architecture 
for ‘active ubiquitylation’. Trp35 of RBX1is highlighted to show howit serves as 
a multifunctional platform for either the RING domain in different 
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orientations, or for the E2-linked NEDD8 during neddylation”’. 

b, Superposition of the structures shown ina, highlighting different relative 
positions of the RING domain. c, Comparison of the relative locations of the 
CUL WHB domain in different structures after superimposing their C/R 
domains (not shown). d, Cryo-EM density from the neddylated CRLIPT?- 
UBE2D-Ub-substrate intermediate complex, showing patchiness of the region 
corresponding to CULI ‘helix-29”. This CUL1 region connecting the C/Rand 
WHB domains is visible only as patchy density, whereas in previous cullin 
crystals this forms the rod-like helix-29 continuing into the WHB domain’. It 
seems that helix-29 of CUL1 dissolves into a flexible tether, which rationalizes 
the previously observed proteolytic sensitivity of this regioninaneddylated 
CUL1-RBX1 complex’, and enables the displacement and rotation required for 
placing the ensuing WHB domain and its linked NEDD8 at the centre of the 
ubiquitylation complex. 


Article 


a 
b Substrate peptide IkBa KERLLD! DEERRASY 
Substrate peptide WT B-Catenin MAAVSHWOQ TTAPRRASY 
Substrate peptide medium B-Catenin AWQQ TTTAPRRASY 
Substrate peptide short B-Catenin TTAPRRASY 
NF2kappa=s inhibitor lalphay (Substrate IkBa in this study) MEGPRDGLEIKERLLD! EEYEQMVBELQEI 


(Cageninlbetasay (Substrate B-Catenin in this study) 
Cellular tumor antigen p53 

NF-Kappa-B inhibitor beta 

NF-Kappa-B inhibitor epsilon 


Prolactin receptor 


cAMP-dependent TF ATF-4 

E3 ubiquitin-protein ligase UHRF1 

M-phase inducer phosphatase 1 

Fanconi anemia group M protein 

Programmed cell death protein 4 

Nuclear factor erythroid 2-related factor 2 
Growth hormone receptor 

DEP domain-containing mTOR-interacting protein 
Interferon alpha/beta receptor 1 
Transcription factor Spl 

Erythropoietin receptor 

Nuclear factor NF-Kappa-B p1l00 subunit 


DisKs large homolog 1 


Serine/threonine-protein Kinase PLK4 
Fizzy-related protein homolog 

Myc proto-oncogene protein 

Heterogeneous nuclear ribonucleoprotein DO 


MEPDRBAAVSHWQQQ 
ELNEALELRBDAQA‘ 
MAGVACLGRAADAD 
AGMSEARKGPDEAEE 
SKEHPSQGMEKPTYLD 
YVAMIPQCIBEEDTP 
SLVLPHSTRERDSEL 
LEVENNSNLOQRMGSS: 
TECQFTNESTSSLAG 
EARINARAKRRLREN 
CKAFNONHPESTAEF! 
LLSSDHEKSHSNLG 
KEIKIVSAVRRSSMS, 
ETNQTDEDHERYSSQ' 
QNKIKGGPGVALSVGT: 
PELPPTPPHLKYLYL 
LRGPETRDKLPSTAE 
SQKRSLYVRALFDYD 
SRNSSTKSKDLGTVE 
PEKKGLFTYSLSTER: 
EEIDVVSVEBROQAPG: 
EGARIDASKNEEDEG! 
DFNTNKCKGFGFVTM' 


TTTAPSLSGRGNPE 
LESKKGOSTSRHKR 
PDAAAPGGPGLGAE 
SLRSLPESTSAPAS 
SPSLLSERCEEPQA 
ESYLGSPQHSPSTR 
SESDKSSTHGEAAA 
PGPLDSKENLENPM 
DEKSVSSNLFLPFE 
SDSGSDALRSGLTV 
'SPSVASPEHSVESS 
EPDILETDFNANDI 
PTLSSSPPVLCNPR 
DESESKTSEELQQD 
SGTATPSALITTNM 
SSGDSQGAQGGLSD 
VEQEAEKLGPPPEP 
LNFKFGDILHVINA 
TAITASSSTSISGS 
YSLSPVSNKSOBLL 
GHSKPPHSPLVLKR 
TAQREEWKMF IGG 
SLNGYRLGDKILOV 


ELAV-like protein 1 
Protein Vpu (isolate BRU/LAI) 


LRORBIDRLIDRLIE: EISALVEMGVEMGH 


Claspin LEINDPNVISQEEAD 
SSIAYSLLSASSEQDI 
TRSPLFIFMRRSSL 


'TIGPLSEGDSDEEIFVSKRLENRR 
SARARTQKELMTALRELKLRLPPE 
TDRSPAPMSCDKSTOTPSPPCQAF 
SSKVDTHKELIKTLKELRKVHLPAD 
DMEYTEAEAEELKRNAETGNLPHS 
ESDLEDDDAVPPGMESLISAPLVR 
EPDRQQPPSGRRGGRERRSSRRSA 


Period circadian protein homolog 1 
Bcl-2-like protein 11 

Period circadian protein homolog 2 
Ubiquitin carboxyl-terminal hydrolase 37 LSLQEFNNSFVDAL 
M-phase inducer phosphatase 2 PLALGRFSLTPAEGD! 
Twist-related protein 1 MMQDVSSSPVS: 


QSPDTFSLMMARSEH 


C Substrate peptide short B-Catenin d 
neddylated CRL1™°P CRL17RCP — SO neddylated CRL1™°P 
=ee ees ad ->- SOCRLITR® * $5+ neddylated CRL1™RCP 
TRCP 
- 08 X 04 ¢ S5CRL1 
$4 g vg - 
= |s3 Zoe, ‘s £03 
eee S2 2 " s 
30.4 Xo $0.2 
~—=———= S1 rm org s 
0.2 oe x o4 
er S0(short B-Cat) 9 —a===8 oof anges 
CONADRASSNWYAR ADS Time (sec) 0.00 0.250.50 0.75 1.001.25 1.501.75 2.00 0.00 0.250.50 0.75 1.001.25 1.501.75 2.00 
SSOSSSNOSOSSOSCSON Time (min) Time (min) 


Extended Data Fig. 6|See next page for caption. 


Extended Data Fig. 6 |Geometry between phosphodegron and acceptor in 
structure, substrates and ubiquitylation. a, Cryo-EM density highlighting the 
relative placement of the substrate degronand the UBE2D-Ub active site. The 
approximately 22 A distance between the UBE2D-Ub active site and the 
phosphodegron of B-TRCP-bound substrate requires at least 6 intervening 
residues ina substrate. b, Alignments for several reported B-TRCP substrates”, 
highlighting the degron sequence (yellow) and nearby lysines (red). Alsoshown 
are sequences of peptide substrates with a single acceptor Lys that were used in 
kinetics analyses. The peptide sequences were derived from IkBa, and from 
B-catenin with varying spacers between phosphodegron and acceptor Lys: 
wild-type B-catenin peptide, medium B-catenin peptide with lysine 
corresponding to IkBa, and short B-catenin peptide witha lysine five residues 
upstream of the N-terminal phosphoserine in the degron, which would be too 
short to bridge the structurally observed distance between the 


phosphodegron binding site on B-TRCP and the UBE2D catalytic cysteine inthe 
ubiquitylation active site. c, Representative autoradiogram (n= 2) of SDS-PAGE 
gel showing products from indicated time points of ubiquitylation reactions 
under multiturnover conditions with either neddylated or unneddylated 
CRL1°T®° and radiolabelled short B-catenin peptide substrate. The amount of 
short B-catenin peptide modified by neddylated CRL1°™®° and UBE2D is too 
low inthe single-encounter ubiquitylation reaction to enable quantification of 
kinetic parameters; however, product formation is apparent under multi- 
turnover conditions and shows that most products are heavily ubiquitylated. 
d, Plots fitting consumption of unmodified short B-catenin peptide substrate 
(SO) compared to formation of polyubiquitin chains with five or more 
ubiquitins (S5+) from reactions asinc. The symbols show the data from 
independent experiments (n=2 technical replicates). 
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Extended Data Fig. 7 | Interactions shaping the catalytic architecture of 
neddylated CRL1°*®°’-UBE2D-Ub-IkBa substrate intermediate. a, NEDD8 
and the catalytic module from the structure representing the neddylated 
CRLIPT8°’_UBE2D-Ub-IkBa intermediate, highlighting distinctive interactions 
between NEDD8 (yellow) and donor ubiquitin (orange) with UBE2D. b, Catalytic 
module from the neddylated CRL1°'?-UBE2D-Ub-IkBa intermediate, 
highlighting the covalently linked proxy for the IkBa substrate’s acceptor inthe 
active site relative to a superimposed representative previous crystal structure 
of anisolated RING-UBE2D-Ub complex (grey, PDB: 4AP4)”*". In the inset, the 
density for the covalently linked proxy for IkBa substrate’s acceptor is shownin 
redin the active site. The chemical trap superimposes with consensus 
acceptors visualized in active sites of sumoylation® and neddylation®” 
intermediates, where aromatic side chains guide the lysine targets (blue and 
green, respectively)**”*. However, the myriad substrates of UBE2D neither 
conform toa specific motif nor do they or UBE2D display specific side chains 
that guide lysine acceptors into the catalytic centre. Instead, inthe neddylated 
CRL1°"°’_UBE2D-Ub-substrate complex, density from backbone atoms 
preceding the chemical proxy for the acceptor lysine corresponds tothe 
aromatic guides in sumoylation and neddylation intermediates. c, Overview of 
assays for activation of intrinsic reactivity of the UBE2D-Ub intermediate. Top, 
schematic of pulse-chase assay for testing the effects of UBE2D mutations on 
activation, monitoring UBE2D-Ub discharge to free lysine activated by 
neddylated CUL-RBX1 compared to unneddylated or RING-like UBE4B 
controls. Bottom, sites of mutations shown as spheres on the structure of 
UBE2D from the cryo-EM structure of neddylated CRLI°"-UBE2D-Ub- 
substrate complex. The colours of spheres reflect both the locations and the 
effects on UBE2D-Ub discharge to free lysine. Sites of mutations with marginal 
or no effect are shown in cyan, whereas those with major effects are otherwise 


coloured. Mutations that cause major defects map to the RBX1 RING-binding 
site (blue), the interaction surface with the donor ubiquitin (orange), and the 
interaction surface with NEDD8 (yellow). d, Representative Coomassie-stained 
SDS-PAGE gels (of two independent experiments) shown for reactions 
monitoring substrate-independent discharge of UBE2D-Ub to free lysine, in 
the presence of CUL1-RBX1(N98R) that was either neddylated or unneddylated 
(K720R), with either wild-type or the indicated UBE2D3 mutants at binding 
sites for backside-bound NEDD8(S22R), the RBX1 RING(F62A), and the 
covalently linked donor ubiquitin in the closed conformation (S108L).e, Asind, 
except testing the effect of NEDD8(Q40E), which would disrupt the activation 
module. f, Reactions performed as ind, except with indicated variants of 
UBE2D2, in reactions with CULI-RBX1(N98R) that was either neddylated or 
unneddylated (K720R), or with the optimized RING-like U-box domain from 
UBE4B as areference**. For mutations reporting on the catalytic conformation 
(G24K, T36K, M38K, A96K and D112K), representative gels are shown for two 
experiments. All other experiments were performed once. g, Comparison of 
B1/B2-loop conformations after superimposing the indicated structures of 
NEDD8 and ubiquitin. The comparison suggests that whereas NEDD8 and 
ubiquitin can adopt both loop-in and loop-out conformations, donors linked to 
E2 active sites in RING-activated complexes adopt the loop-in conformation, 
and those bound to the UBE2D backside adopt loop-out conformations. h, Only 
aloop-out conformation is compatible with the neddylated CRL activation 
module structure, because loop-in conformation from the structures shown in 
g would prevent noncovalent interactions with the WHB domain of CUL1 
(green).i, Only aloop-out conformation is compatible for the CUL1-linked 
NEDD8 to bind the catalytic module, because loop-in conformations inthe 
structures shown ing would prevent noncovalent interactions with the 
backside of UBE2D (cyan). 
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Extended Data Fig. 8 | See next page for caption. 


Extended Data Fig. 8 | Qualitative validation of the mechanistic principles 
that underlie substrate priming by neddylated CRLs and UBE2D. Gel panels 
in this figure are representative of two independent experiments; n=2 
technical replicates. a, Schematic of a qualitative substrate priming assay for 
testing effects of mutations in neddylated CRL1I°™ or UBE2D on substrate 
priming, monitoring fluorescent ubiquitin transfer from UBE2D3 to the 
phosphorylated IkBa substrate. b, Scan of SDS-PAGE detecting fluorescent 
ubiquitin transferred to the IkBa-derived substrate in a qualitative assay for 
NEDD8 activation of substrate priming. c, Asinb, showing the effect on 
substrate priming of disrupting the activation module with the Q40E mutation 
of NEDD8.d, Asinb, showing the effect on substrate priming of disrupting 
interactions between the activation and catalytic modules with the 
NEDD8(I44A) or UBE2D(S22R) mutants. e, Asin b, showing the effect on 
substrate priming of disrupting interactions between the activation and 
substrate-scaffolding modules, though CUL1 modification by a ‘ubiquitylized’ 
NEDD8 mutant with six residues swapped for their ubiquitin counterparts 
(L2Q/K4F/E14T/D16E/G63K/G64E). f, Asin b, showing the effect of the H32Ain 
UBE2D at the interface between the catalytic and substrate-scaffolding 
modules. g, Scheme of pulse-chase assay for testing the effects of mutations in 


neddylated CRL1™”’ or UBE2D onsubstrate priming. The assay monitors the 
transfer of fluorescent ubiquitin from UBE2D to peptide substrate derived 
from phosphorylated cyclin E (pCyE).h, Fluorescent scan detecting ubiquitin 
transferred to the pCyE substrate by neddylated CRL1”” and the indicated 
mutants of UBE2D. i, Fluorescent scan detecting ubiquitin transferred to the 
pCyE substrate by UBE2D and indicated variants of neddylated (or 
ubiquitylated) CRL1°””. Experiment with unneddylated CRL1'”’ used the 
K720R variant of CUL1 to prevent artefactual ubiquitylation.j, Scheme of 
pulse-chase assay for testing effects of mutations in neddylated CRL4°% or 
UBE2D on substrate priming, monitoring fluorescent ubiquitin transfer from 
UBE2D to the IKZF1/3 ZF2 substrate in the presence of the immunomodulatory 
drug pomalidomide.k, Fluorescent scan of assay validation, showing 
dependence on pomalidomide. I, Fluorescent scan detecting ubiquitin 
transferred to the IKZF substrate by CRL4“""", pomalidomide and the indicated 
variants of UBE2D. m-o, Fluorescent scan detecting ubiquitin transferred to 
the IKZF substrate by UBE2D and the indicated variants of neddylated (or 
ubiquitylated) CRL4°®’ with pomalidomide. Experiments with unneddylated 
CRL4°""" used the K7OS5R variant of CUL4A to prevent artefactual 
ubiquitylation. 
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Extended Data Table 1| Estimates of K,,, and k,,, for substrate ubiquitylation 


Extended Data Table 1. Estimates of Km and Kops for Substrate Ubiquitylation 


Kops®°S1/ Km Fold 


Substrate E2 CRL16-TRCP Km(10°M) — Kops8°S1 (sect) — Kops$*-$? (sec) (M'sec!) change 
kBa WT UBE2D NEDD8-CUL1—RBX1 408 + 57 12.9+1.7 0.24 + 0.04 3.2 x 10’ - 
kBoa WT UBE2D CUL1-RBX1 4014 +956 0.05 + 0.002 - 1.2 x 10* 2667 
WT B-cat WT UBE2D NEDD8-CUL1—RBX1 214 + 30 4.8+0.46 0.19 + 0.02 2.2 x 10” - 
WT f-cat WT UBE2D CUL1—RBX1 4638 + 138 0.05 + 0.005 - 1.1 x 104 2000 
medium fB-cat WT UBE2D NEDD8-CUL1—RBX1 372 + 48 11+1.00 0.24 + 0.04 3.0 x 10” - 
medium fB-cat WT UBE2D CUL1-—RBX1 4875 + 1328 0.08 + 0.005 - 1.6 x 104 1875 
short B-cat WT UBE2D CUL1—RBX1 5207 +1051 0.02 + 0.0004 - 3.8 x 10° 

medium fB-cat WT UBE2D NEDD8 I44A-CUL1—RBX1 1087 +69 3.60 + 0.27 0.17 + 0.03 3.3 x 10° Chil 
medium B-cat WT UBE2D UB-CUL1—RBX1 2056 + 170 1.0 + 0.03 0.15 + 0.01 4.9 x 10° 61.2 
medium B-cat WT UBE2D NEDD8 Q40E-CUL1—RBX1 1941 +226 1.77 40.13 0.07 + 0.01 9.1 x 10° 33.0 
medium fB-cat WT UBE2D UBylized NEDD8-CUL1—RBX1 1651 + 137 2.20 + 0.15 0.17 + 0.04 1.3 x 10° 23.1 
medium fB-cat S22R UBE2D NEDD8-CUL1—RBX1 913 + 135 1.89 + 0.14 0.17 + 0.03 2.1 x 10° 14.2 
medium B-cat S22RUBE2D NEDD8 144A-CUL1—RBX1 5190 + 604 0.13 + 0.01 - 2.5 x 10* 1200 
medium fB-cat S22R UBE2D UB-CUL1—RBX1 6492 + 872 0.08 + 0.005 - 1.2x 104 2500 
medium fB-cat S22R UBE2D CUL1-—RBX1 4007 +754 0.013 + 0.002 - 3.3 x 10° 9091 
medium fB-cat S22R UBE2D NEDD8 Q40E-CUL1—RBX1 3581 + 367 0.17 + 0.005 - 4.7 x 104 638.3 
medium B-cat S22RUBE2D UBylized NEDD8-CUL1—RBX1 6105 + 1030 0.12 + 0.007 - 2.0 x 104 1500 
medium B-cat H32A UBE2D NEDD8-CUL1-RBX1 347 + 53 8.15 + 1.04 0.73 + 0.09 2.3 x 10’ les 
medium B-cat H32A UBE2D NEDD8 |I44A-CUL1-RBX1 3431 + 332 0.58 + 0.04 0.12 + 0.02 1.7x 10° 176.5 
medium B-cat H32A UBE2D UB-CUL1—RBX1 2999 + 301 0.34 + 0.01 0.10 + 0.02 1.1x 10° 272.7 
medium B-cat H32A UBE2D CUL1-RBX1 3305+1119 0.06 + 0.003 - 1.8 x 104 1666.7 
medium B-cat H32A UBE2D NEDD8 Q40E-CUL1—RBX1 2923 + 459 0.81 + 0.06 - 2.8 x 10° 107.1 
medium B-cat H32AUBE2D  UBylized NEDD8-CUL1-—RBX1 43034310 0.28 + 0.02 - 6.5 x 104 461.5 

UB-B-cat WT UBE2D NEDD8-CUL1—RBX1 148 +15 - 0.48 + 0.02 

UB-B-cat WT UBE2D CUL1—RBX1 1503 + 439 - 0.005 + 0.0004 


SO refers to unmodified substrate, S1 to substrate modified by a single ubiquitin and S2 to substrate modified with two ubiquitins. SO-S1 refers to the transition from unmodified substrate to 
Ub-substrate, and S1-S2 to the transition from Ub-substrate to Ub-Ub-substrate. Values for K,, are the best-fit values derived from nonlinear regression in Prism, and value for k,,, are the best-fit 
values derived from nonlinear regression in Mathematica. The measure of error is the standard error of the mean as determined by Prism and Mathematica, respectively, from experiments and 
curve fits such as those shown in Extended Data Fig. 1 (n = 2). 


Extended Data Table 2 | Cryo-EM data collection, refinement and validation statistics 


Cryo-EM data collection, refinement and validation statistics 


IkBa-—UB~UBE2D crosslink 
(C211 C107A C111D) 
Substrate Receptor 


RBX1 
NEDD8 
SKP1 


Data collection and processing 
Microscope 
Magnification 
Voltage (kV) 
Electron exposure (e-/A’) 
Defocus range (tum) 
Pixel size (A) 
Symmetry imposed 
Initial particle images (no.) 
Final particle images (no.) 
Map resolution (A) 

FSC threshold 
Map resolution range (A) 


Refinement 

Initial model used 

(PDB code) 

Model resolution (A) 
FSC threshold 


Map sharpening B factor (A*) 
Model composition 
Non-hydrogen atoms 
Protein residues 
Ligands 
B factors (A”) 
Protein 
Ligand 
R.m.s. deviations 
Bond lengths (A) 
Bond angles (°) 
Validation 
MolProbity score 
Clashscore 
Poor rotamers (%) 
Ramachandran plot 
Favored (%) 
Allowed (%) 
Disallowed (%) 


UBE2D2 
2x2way XL 
B-TRCP2 


WT 

yes 

WT 
EMD- 
10578 


Krios 
105,000 
300 
56 
-1.2~-3.6 
1.34 
C2 
2,575,161 
33,738 
9.3 
(0.143) 


-578.9 


UBE2D3 UBE2D3 
3way XL 3way XL 
B-TRCP1 B-TRCP1 
175-C 175-C 
WT N98R 
yes yes 
AA AA 
EMD- EMD- 
10579 10580 
Arctica Arctica 
92,000 73,000 
200 200 
61.3 60.8 
-1.5 ~ -3.5 -1.5~-3.5 
1.612 1.997 
Cl Cl 
464,344 601,121 
47,246 107,311 
8.6 94 
(0.143) (0.143) 
-1159 -1272 


UBE2D2 
3way XL 
B-TRCP1 
175-C 
N98R 
yes 
AA 
EMD- 
10581 


Arctica 
73,000 
200 
70 
-1.5~-3.5 
1.997 
Cl 
459,011 
40,835 
8.4 
(0.143) 


-983.5 


UBE2D2 
3way XL 
B-TRCP1 
175-C 
N98R 
yes 
AA 
EMD- 
10585 
PDB 6TTU 


Krios 
130,000 
300 
70.2 
-1.2~-3.3 
1.06 
Cl 
1,661,870 
106,257 
3.72 
(0.143) 
3.46 ~ 6.0 


1LDJ 6M90 
4P50 4V3L 
37 
(0.143) 


-94.2 


13028 
1616 
3(ZN) 


88.6 
203.8 


0.006 
0.850 


2.23 
13.09 
0.07 


91 
9 
0 


none 


B-TRCP1 
175-C 
WT 
no 
WT 
EMD- 
10582 


Glacios 
36,000 
200 
60 
-1.2~-3.3 
1.181 
Cl 
2,051,804 
262,116 
4.64 
(0.143) 


-199 


none 


B-TRCP1 
175-C 
WT 
yes 
WT 
EMD- 
10583 


Glacios 
22,000 
200 
59 
-1.2 ~-3.3 
1.885 
Cl 
1,666,293 
349,803 
6.7 
(0.143) 
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Microtubules are dynamic polymers of a- and B-tubulin and have crucial roles in cell 
signalling, cell migration, intracellular transport and chromosome segregation’. They 
assemble de novo from a-tubulin dimers in an essential process termed microtubule 
nucleation. Complexes that contain the protein y-tubulin serve as structural 
templates for the microtubule nucleation reaction’. In vertebrates, microtubules are 
nucleated by the 2.2-megadalton y-tubulin ring complex (y-TuRC), which comprises 
y-tubulin, five related y-tubulin complex proteins (GCP2—GCP6) and additional 
factors*. GCP6 is unique among the GCP proteins because it carries an extended 
insertion domain of unknown function. Our understanding of microtubule formation 
incells and tissues is limited by a lack of high-resolution structural information on the 
y-TuRC. Here we present the cryo-electron microscopy structure of y-TuRC from 
Xenopus laevis at 4.8 A global resolution, and identify a 14-spoked arrangement of GCP 
proteins and y-tubulins ina partially flexible open left-handed spiral with a uniform 
sequence of GCP variants. By forming specific interactions with other GCP proteins, 
the GCP6-specific insertion domain acts asa scaffold for the assembly of the y-TuRC. 
Unexpectedly, we identify actin as a bona fide structural component of the y-TuRC 
with functional relevance in microtubule nucleation. The spiral geometry of y-TuRC is 
suboptimal for microtubule nucleation and a controlled conformational 
rearrangement of the y-TuRC is required for its activation. Collectively, our cryo- 
electron microscopy reconstructions provide detailed insights into the molecular 
organization, assembly and activation mechanism of vertebrate y-TuRC, and will serve 
as a framework for the mechanistic understanding of fundamental biological 
processes associated with microtubule nucleation, such as meiotic and mitotic 
spindle formation and centriole biogenesis‘. 


To understand the structural basis of microtubule (MT) nucleation in ver- 
tebrates, we purified y-TuRCs from. laevis meiotic egg extract by affinity 
chromatography, and confirmed the integrity of the complexes by MT 
nucleation analysis, immunoblotting, sucrose gradient density sedimen- 
tation and negative-stain electron microscopy (Extended Data Fig. 1). To 
gain insights into the relative abundance of GCP variants in the complex, 
we analysed the purified y-TuRC by label-free quantification (LFQ) mass 
spectrometry and determined the stoichiometry of the components nor- 
malized to a 14-spoked y-TuRC. The purified y-TuRC contains five copies 
of GCP2, five copies of GCP3, two or three copies of GCP4 and one copy 
of GCPS and GCP6, resulting in a 5:5:2/3:1:1 stoichiometry (Figs. la, 2a). 


Molecular architecture of the y-TuRC 


Using cryo-electron microscopy (cryo-EM) single-particle analysis, 
we obtained reconstructions of the purified y-TuRC (Extended Data 


Fig. 2a—e). Local resolution assessment for initial reconstructions 
indicated conformational flexibility (Extended Data Fig. 2f), which 
we compensated for computationally by splitting the y-TuRC into seg- 
ments that were refined independently (Extended Data Fig. 2e). This 
approach resulted ina final reconstruction at 4.8 A global resolution, 
ranging from 4.5 Ato 6A locally (Extended Data Fig. 2g, h). The recon- 
struction (Fig. 1a) indicates that the y-TuRC consists of 14 structurally 
similar spokes arranged in an open left-handed spiral with a diameter 
of approximately 32 nm and a height of around 25 nm. Each spoke con- 
sists of one GCP and one y-tubulin bound to the conserved GCP GRIP2 
domain (Fig. 1b). Resolved features distinguish the structurally closely 
related GCPs in the different positions and indicate that the GCP vari- 
ants assemble in a defined order. Notably, the N-terminal extensions 
present in all GCPs except GCP4 (Extended Data Fig. 1a) were mostly 
not resolved, which suggests a high degree of conformational flex- 
ibility for these regions. 
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For an initial investigation of the y-TuRC architecture, we structur- 
ally grouped the 14 GCP-y-tubulin spokes by computing pairwise 
cross-correlation coefficients between density segments for each 
spoke (Fig. 2b, Extended Data Fig. 3a) and by computing the pairwise 
root mean square deviation (r.m.s.d.) values of atomic models for 
human GCP4 (Protein Data Bank (PDB) code 3RIP) and y-tubulin 
(PDB code 1Z5W) docked into all 14 spokes (Fig. 2b, Extended Data 
Fig. 3b). To account for the different conformational states of GCPs, 
the GCP4 model was split into three individual segments (Methods), 
which were then docked independently. Both approaches clustered 
the spokes into 5 groups with 5:5:2:1:1 stoichiometry, namely: (i) 
spokes 1, 3, 5, 7 and 13; (ii) spokes 2, 4, 6, 8 and 14; (iii) spokes 9 and 
11; (iv) spoke 10; and (v) spoke 12. In combination with LFQ mass 
spectrometry (Fig. 2a), this indicates that the most abundant GCP 
groups (i) and (ii) correspond to GCP2 and GCP3. Most prominently, 


Spoke Spo 


jals|el7 [8] 


number 


ke number 


Fig. 1|Cryo-EMstructure of the X. laevis 
y-TuRC. a, Reconstruction of the y-TuRC, 
filtered according to local resolution. For each 
spoke, the consecutive numbering is given. 

b, General layout of a y-TuRC spoke. Atomic 
models for y-tubulin (grey) and GCP2 (rainbow- 
coloured from the N (blue) to the C (red) 
terminus) superposed to the cryo-EM density. 
c, Detailed view on molecular components 
involved in binding of actin. 


in 


lm GcP6 


D-loop 


GCPs in these two groups were structurally distinguished by the 
length of two a-helices in the GRIP2 domain (Fig. 2c, Extended Data 
Fig. 3c). These two helices were predicted to be considerably longer 
in GCP3 than in all other GCP variants (Extended Data Fig. 4a), which 
allows unambiguous assignment of group (ii) to GCP3 and conversely 
indicates that GCPs in group (i) correspond to GCP2. This could be 
confirmed by a GCP2-specific extended loop (Extended Data Fig. 4b) 
between two GRIP2 B-strands visible exclusively in the density for 
group (i) spokes (Fig. 2c, Extended Data Fig. 3c). Structural clus- 
tering in conjunction with LFQ mass spectrometry also suggested 
that group (iii) with spokes at positions 9 and 11 corresponded to 
GCP4. GCP4 is the only GCP variant without an N-terminal exten- 
sion (Extended Data Fig. 1a), and only group (iii) spokes had no con- 
tinuous density extending from the very N-terminal helix (Fig. 2c, 
Extended Data Fig. 3d). Although the two remaining spokes (spokes 


Fig. 2| Structural clustering of y-TuRC spokes. 
a, Relative abundance of y-TuRC components as 
determined by LFQ mass spectrometry. Total 
number of GCP proteins is normalized to 14, and 
the abundance of each proteinis calculated 
accordingly. n=3 biologically independent 


High similarity 


Stoichiometry 


Spoke number 


experiments; data are meant+s.d.b, Left, pairwise 
cross-correlation coefficients between density 


lial 


Low similarity 


segments. Extended Data Figure 3a provides 


Cross-correlation of EM density 
segments 


y-tubulin 


correlation values. Right, pairwise r.m.s.d. values 
between Ca atoms of rigid-body-fitted atomic 
models (Methods) representing the individual 
spokes. Extended Data Figure 3b provides r.m.s.d. 
values. c, Cluster-specific structural features. 
Left, GCP3-specific extended C-terminal 
a-helices (red model and density) are unique for 
group (ii), and a GCP2-specific extended loop 
between the GRIP2 B-strands (blue model and 
density) is present onlyin group (i). Right, only 


N-terminal helix 
of GCP4 
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Representative spoke 
of group (i), GCP2 


Representative spoke 
of group (ii), GCP3 
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Other spokes 


Representative 
spoke of group 
(iii), GCP4 


group (iii) spokes are devoid of continuous 
density (yellow) connecting to the N-terminal 
helix (red) of the fitted GCP4 model. See Extended 
Data Fig. 3c, d for features for all spokes. 
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Fig. 3 | The GCP6 insertion domain assistsina 
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stepwise assembly process of the y-TuRC. a, Protein 
constructs for immunoprecipitation. IDo, insertion 
domain. b, Co-immunoprecipitation (IP) of 3xFlag- 
GCP6 insertion domain with the 3xMyc-tagged N 
terminus of GCP proteins. 3xMyc-eGFP was used asa 
negative control. Theindicated antibodies were used 
for immunoblotting. Result is representative of three 
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neighbour along the helical sequence. Colours areas 
inf. 
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10 and 12) lacked distinguishing structural features of the GCP core 
segments, LFQ mass spectrometry (Fig. 2a) indicated that they cor- 
respond to the single copies of GCP5 and GCP6. Notably, we could 
not trace the GCPS and GCP6 insertion domains (Extended Data 
Fig. 1a) in the cryo-EM reconstruction, which suggests that at least 
the segments directly associated with the GCP core fold are not 
well-ordered. 

We next generated homology models for the core segments of X. 
laevis GCP2-GCP6 and X. laevis y-tubulin using X-ray structures of 
human GCP4° and y-tubulin®, respectively and refined them against 
the cryo-EM density (Methods). The resulting atomic model had good 
statistics (Extended Data Table 1) and was validated against our cryo- 
EM density (Extended Data Fig. 2h). Consistent with the determined 
local resolution (Extended Data Fig. 2g), bulky amino acid side chains 
were resolved in many areas of the density and corresponded with 
the refined homology models (Extended Data Fig. 5, Methods). GCP- 
variant-specific bulky amino acid side chains confirmed our previ- 
ous assignment of GCP identities for GCP2-GCP4 and also allowed 
identification of GCPS (spoke 10) and GCP6 (spoke 12) in the cryo-EM 
density, thus establishing an atomic model for the structural core of the 
vertebrate y-TuRC with a uniform sequence of GCPs: GCP(2-3),-GCP4— 
GCP5-GCP4-GCP6-(GCP2-3) (Fig. 1a). This order was unexpected 
because GCP4, GCP5 and GCP6 had previously been proposed to cap 
the y-TuRC spiral?’. 


8 9 10 11 12 13 14 


Actinis an integral part of the y-TuRC 


Several density segments remained unexplained (Fig. 1a), and the most 
prominent was a belt of density lining the interior of the y-TuRC that 
was already visible in the negatively stained y-TuRC particles (Extended 
Data Fig. 1h, i). For assignment of these density segments, we pursued 
an unbiased structure-guided approach (Methods), which we verified 
by localizing y-tubulin and GCP4 in the y-TuRC as positive controls 
(Extended Data Fig. 6a) and ovalbumin as a negative control (Extended 
Data Fig. 6b). We were unable to localize established y-TuRC compo- 
nents, including NEDD1 and NME7 that were identified by LFQ mass 
spectrometry (Fig. 2a), and MOZARTI (Extended Data Fig. 6c), which 
suggests that they are not covered in our cryo-EM reconstruction and 
probably associate with the unresolved GCP N termini. Many other 
proteins found to be abundant in the purified y-TuRC could also not 
be localized (Methods). Unexpectedly, actin—also identified as a puta- 
tive y-TuRC component (Fig. 2a)—produced an unambiguous fit ina 
globular domain of the belt density (Fig. 1c, Extended Data Fig. 6d), 
where it bridges an a-helical belt domain with the y-tubulin of spoke 2 
(Fig. 1c). Actin binds to y-tubulin via its D-loop—a canonical interaction 
also involved in the formation of actin filaments and DNasel binding®. 
We confirmed the presence of actin by immunoblotting (Extended 
Data Fig. 6e) and showed that it colocalized with GCP6 by immuno- 
fluorescence microscopy (Extended Data Fig. 6f, g). Centrosomes have 
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Fig. 4| Geometrical and conformational analysis of the y-TuRC. a, Residues 
distributed along each spoke (coloured circles) were used to define centroids 
of the approximate helical axis (brown). An MT cross-section is shown for 
comparison. b, Spokewise elevation along the helical axis (incremental pitch) 
for the y-TuRC, the closed y-TuSC spiral (PDB code 5FLZ) and a13-spoked MT 
(PDB code 6EWO).c, Distance of y-tubulins from the helical axis, plotted for the 
same complexes as inb.d, Inclinations between GRIP1 and GRIP2 axes, as 


previously been proposed to nucleate not only MTs but also actin fila- 
ments’ and we therefore tested whether y-TuRC-associated actin could 
be involved. Although the ARP2/3-VCA complex used as control showed 
robust actin filament nucleation activity in vitro, the purified y-TuRC 
was inactive (Extended Data Fig. 6h). We next used an inhibition experi- 
ment to test whether y-TuRC-associated actin could have a role in MT 
nucleation. The actin-binding protein DNasel binds the actin D-loop 
with high affinity, and thus competes with actin binding to the y-tubulin 
of spoke 2°. DNasel treatment of purified y-TuRC significantly inhibited 
their MT nucleation activity in vitro, and pre-incubation of DNasel 
with actin abolished this effect (Extended Data Fig. 6i, j). Thus, actin 
isa bona fide structural component of the y-TuRC, and has functional 
relevance in MT nucleation. 


GCPé6 assists in assembly of the y-TuRC 


Candidates for representing the remaining mostly a-helical belt den- 
sity are the insertion domains of GCP5 and GCP6. The GCP3S insertion 
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Supplementary Videos 1 and 2 for visualization of the conformational change. 


domainis comparably short (120 residues) and predicted to be mostly 
unordered (Extended Data Fig. 4c). By contrast, the 750-residue-long 
GCP6 insertion domain contains asegment of 249 residues predicted 
to be highly a-helical (approximately 70%) (Extended Data Fig. 4d), 
which we confirmed by circular dichroism spectroscopy (Extended 
Data Fig. 4e). Furthermore, the length of this GCP6 segmentis in good 
agreement with the size of the a-helical belt domain (Methods), render- 
ing the GCP6 insertion domain a genuine candidate for representing 
this part of the cryo-EM density. 

Our cryo-EM density suggests a direct interaction between the 
a-helical belt domain and the N-terminal region of various GCP pro- 
teins. Following this observation, we analysed whether the human 
GCP6 insertion domain (residues 606-1499) (Fig. 3a) has the ability 
to interact with N-terminal domains of human GCP proteins. Indeed, 
the N-terminal domains of GCP2 and GCP5 (GCP2-N and GCPS-N, 
respectively) were robustly co-immunoprecipitated with the GCP6 
insertion domain (Fig. 3b), whereas GCP3-N showed only weak binding 
(Fig. 3b, asterisk). Enhanced green fluorescent protein (eGFP) that was 


used as control and GCP4-N were not detected. To narrow down which 
regions of the GCP6 insertion domain interact with GCP-N termini, 
we divided the human GCP6 insertion domain into three subdomains 
(Fig. 3a). GCP2-N and GCPS-N interacted specifically with the mostly 
conserved a-helical GCP6(606-1026) fragment, denoted part 1 ofthe 
GCP6 insertion domain, which consequently could mediate specific 
recruitment of GCP2 and GCP3S to the y-TuRC (Extended Data Figs. 
7a, 8). Mutations that disrupt the a-helical structure (Extended Data 
Fig. 8) strongly reduced binding to GCP2-N and GCP5-N (Extended 
Data Fig. 7b), which validates the role of the GCP6 insertion domain for 
GCP recruitment. In contrast to the a-helical region, the nine repeats 
in the GCP6 insertion domain did not interact with GCP2 and GCP5 
(Extended Data Fig. 7a). Differential extraction of GCP variants from 
the purified y-TuRC by salt treatment suggests a stable core of GCP4, 
GCPS and GCP6 that is resistant to harsh salt treatment (Fig. 3c). Mild 
salt treatment depleted y-TuRC of peripheral GCP subunits (most likely 
GCP2-GCP3 complexes), as indicated by negative-stain 2D classes 
(Fig. 3d). These data suggest a stepwise assembly mechanism in which 
the GCP6 insertion domain mediates specific recruitment of one pre- 
assembled GCP4-GCP5 dimer” to a GCP4-GCP6 core”, before the 
binding of preformed GCP2-GCP3 complexes (Fig. 3e). The presence 
of preformed GCP2-GCP3 complexes is not only supported by their 
concomitant loss after a salt wash always resulting in even-spoked 
y-TuRCs (Fig. 3d), but also by the observation of a pairwise pattern 
in the distances of GCP and y-tubulin molecules, indicating tighter 
interaction within the GCP2-GCP3 dimers compared with interdimer 
interactions (Fig. 3f, g). 


Conformational activation of the y-TuRC 


In vertebrates, the y-TuRC is assembled in a state with only basal MT 
nucleation activity and activators are required to stimulate MT nuclea- 
tion”>, To gain insights into the structural basis of y-TuRC activation, 
we explored whether the geometry of the y-TuRC in our cryo-EM recon- 
struction would be compatible with templating MT nucleation. We 
approximated the y-TuRC helical axis (Fig. 4a) and, for each spoke, 
determined the incremental y-tubulin elevation along the axis (Fig. 4b) 
and the helix radius (Fig. 4c). We observed no strict helical symmetry 
for both parameters, which indicates that the purified y-TuRC was 
geometrically incompatible with being a structural template for a 
13-spoked MT (Fig. 4b, c). To understand the structural basis for the 
observed deviations in helical symmetry, we analysed the intrinsic 
conformational arrangement of GRIP1 and GRIP2 domains for each 
individual spoke. The domain arrangement was similar for all copies 
of GCP2, GCP3 and GCP5S, but differed for GCP4 and GCP6 (Fig. 4d). 
Here, the GRIP domains were arranged in a more-stretched conforma- 
tion, resulting in position-specific displacement of the y-tubulins that 
could contribute to the observed distortion of helical symmetry. We 
extrapolated the structure of an MT-nucleation-competent y-TUuRC 
based on the previously described Saccharomyces cerevisiae y-TuSC 
spiral known to have high MT nucleation activity, and visualized the 
required global conformational rearrangement, mostly representing a 
contraction of the y-TuRC spiral that is accompanied by repositioning 
of y-tubulins to achieve a uniform spacing that reflects MT symmetry 
(Fig. 4e, Supplementary Videos 1, 2). Such structural rearrangements 
could occur spontaneously during MT nucleation, explaining the basal 
nucleation activity of purified y-TuRC in vitro (Extended Data Fig. 1c), or 
they could be induced by activators of MT nucleation such as CEP215”. 
To test the latter hypothesis, we added the purified CEP215 N terminus 
(CEP215-N) including the activating CM1 motif in large excess to puri- 
fied y-TuRC, and analysed the structural and functional properties of 
the complexes using negative-stain electron microscopy and in vitro 


MT nucleation assays (Extended Data Fig. 9a, b). Although CEP215-N 
bound to y-TuRC in vitro (Extended Data Fig. 9c), we did not observe 
structural changes or astrong increase in MT nucleation activity, which 
suggests the requirement of other factors that potentially act together 
with CEP215-N. Consistently, the addition of recombinant CEP215-N 
to egg extract stimulated MT nucleation only in combination with 
Ran(Q69L) (loaded with GTP)—a GTP-hydrolysis-defective mutant of 
the GTPase Ran (Extended Data Fig. 9d). Notably, the N terminus of 
the CEP215(F75A) mutant” that was less efficient in y-TuRC binding 
(Extended Data Fig. 9c) could not stimulate MT nucleation activity in 
egg extract (Extended Data Fig. 9d). 

Collectively, these biochemical and structural data fundamentally 
deepen our understanding of MT nucleation by providing detailed 
insights into the molecular organization, assembly and activation 
mechanism of vertebrate y-TuRC. Our study provides a rationale for 
the evolutionary acquisition of the GCP4-GCP6 variants in vertebrates. 
In contrast to budding yeast, in which the MT nucleating template 
is assembled only at the yeast centrosome for local MT nucleation", 
our data suggest that the acquisition of GCP4 and GCP6—both con- 
tributing to the asymmetry of y-TuRC—enables context-independent 
pre-assembly of inactive y-TuRC that can be efficiently activated viaa 
conformational change. This allows much faster and more dynamic 
regulation of MT nucleation within the assembling mitotic spindle, 
in which hundreds to thousands of MTs have to be nucleated within 
minutes”. 
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Methods 


Data reporting 

No statistical methods were used to predetermine sample size. The 
experiments were not randomized and investigators were not blinded 
to allocation during experiments and outcome assessment. 


Plasmid construction 

PCRamplifications were performed with Q5 High Fidelity DNA Polymer- 
ase (NEB). Fragments were inserted into the backbones with NEBuilder 
HiFi DNA Assembly Cloning Kit (NEB). To generate 3xFlag-tagged GCP6 
constructs, corresponding fragments for the GCP6 insertion domain 
(residues 606-1499, denoted GCP6-IDo), the 4P mutant of the GCP6 
insertion domain (denoted GCP6-IDo*"), the GCP6 insertion domain 
part 1 (residues 606-1026, denoted IDo-P1), the 4P mutant of the GCP6 
insertion domain part 1 (denoted IDo-P1*”), the GCP6 insertion domain 
9 repeats (residues 1027-1268, denoted IDo-9 repeats) and the GCP6 
insertion domain part 2 (residues 1269-1499, denoted IDo-P2) were 
amplified and inserted into BamHI-digested pRetroX-TRE3G vector 
and sub-cloned into pCMV-3Tag-1A vector via BamHI. To generate 
3xMyc-tagged GCP6 constructs, eGFP, GCP2-N (1-1485 bp), GCP3-N 
(1-1620 bp), GCP4-N (1-1008 bp) and GCPS5-N (1-2142 bp) were ampli- 
fied and inserted into the BamHI site of pCMV-3Tag-2B. An additional 
base pair ‘A’ was introduced upstream to avoid frameshift and the 
BamHI site was regenerated downstream. The coding sequence for 
residues 546-794 of X. laevis GCP6 (xGCP6(546-794)) was amplified 
and inserted into a BamHI-digested pGEX-6P-1 vector to generate pGEX- 
6P1-xGCP6(546-794). Subsequently, the plasmid was digested with 
EcoRI/Notl anda Strepll tag was introduced after the coding sequence 
of xGCP6(546-794). 


Antibodies 

Anti-y-tubulin rabbit polyclonal antibody, which was used for y-TuRC 
purification and the CEP215-N pull-down assay, was generated against 
the C-terminal peptide’. Anti-y-tubulin mouse monoclonal antibody 
(GTU-88), which was used for immunoblotting, was from Sigma- 
Aldrich. Rabbit anti-GCP2 polyclonal antibody was from Thermo 
Fisher Scientific. Rabbit anti-GCP3 polyclonal antibody and rabbit 
anti-GCP6 polyclonal antibodies were from Y. Zheng. Guinea-pig anti- 
GCP6 polyclonal antibody for immunofluorescence was generated 
as previously described”. Anti-B-actin mouse monoclonal antibody 
(AC-74) used in immunofluorescence and anti-actin rabbit polyclonal 
antibody (A2066) used in immunoblotting were from Sigma-Aldrich. 
Anti-GCP4 rabbit polyclonal antibodies were raised against full-length 
purified GCP4. Anti-GCP5 mouse monoclonal antibody (E-1) was from 
Santa Cruz Biotechnology. Mouse anti-Flag monoclonal antibody (9A3) 
and rabbit anti-GAPDH polyclonal antibody (14C10) were from Cell 
Signaling Technology. Mouse monoclonal anti-Myc antibody (clone 
9E10) was from Sigma-Aldrich. Secondary antibodies were: donkey 
anti-mouse Alexa Fluor 488-conjugated antibody and goat anti-guinea 
pig Alexa Fluor 555-conjugated antibody (Thermo Fisher Scientific); 
peroxidase-conjugated goat anti-mouse antibody (Jackson ImmunoRe- 
search Laboratories); donkey anti-mouse DyLight 680 and 800-con- 
jugated antibodies (Thermo Fisher Scientific); anti-rabbit DyLight 
680-conjugated antibody (Cell Signaling Technology); IRDye 8300CW 
Donkey anti-Rabbit IgG (LI-COR Biosciences). 


y-TuRC purification 

CSF-arrested X. laevis egg extracts were prepared as previously 
described” and stored at -80 °C. To purify the y-TuRCs, an adequate 
amount of egg extracts was defrosted and incubated with y-tubulin 
antibody-crosslinked Dynabeads Protein A for 30 min at room tem- 
perature and subsequently washed with: (1) 3 times with CSF-XB buffer 
(5mM EGTA, 10 mM HEPES pH7.7, 2mM MgCl, 50 mM sucrose, 100 mM 
KCI, 0.1mM CaCl,); (2) 3 times with CSF-XB buffer supplemented with 


250 mM KCl and 0.3% Triton X-100; and (3) twice with HB100 buffer 
(SO mM Na-HEPES pH 8.0, 1mM EGTA, 1mM MgCl,, 100 mM NaCl) con- 
taining 0.1mM GTP. Elution was performed overnight at 4 °C with gentle 
rotation in HB100 buffer containing 0.1 mg ml“ y-tubulin antigenic 
C-terminal peptide’, 1 mM GTP and 0.02% Tween20. The concentration 
of the purified y-TuRCs was determined as 5 nM by immunoblotting 
comparing to purified human y-tubulin. 


y-TuRC salt treatments 

Both mild and harsh washing were performed with the same protocol 
as y-TuRC purification except washing step (2). For the mild washing, 
washing step (2) was carried out with CSF-XB buffer supplemented 
with 500 mM KCl and 0.3% Triton X-100 (3 times). For harsh washing, 
washing step (2) was carried out sequentially with CSF-XB buffer con- 
taining 0.3% Triton X-100 and 4 different concentrations of additional 
KCI: 250 mM (4x), 500 mM (1x), 750 (1x) and 1M (1x). The beads were 
washed with each buffer for 15-20 min (at 4 °C with gentle rotation) 
and the proteins were eluted as described in the y-TuRC purification 
protocol. The band intensity was quantified with the software Image 
Studio Lite (v.5.2.5, LI-COR Biosciences) using the Analysis function. 


In-gel tryptic digestion, LC-MS/MS analysis and database search 
Samples were separated by SDS-PAGE (minigel, 10%; Invitrogen) for 
1.0 cm. Coomassie-stained lanes were cut out with a scalpel and pro- 
cessed as previously described". In brief, samples were reduced with 
dithiothreitol (DTT), alkylated with iodoacetamide and digested with 
trypsin. Peptides were extracted from gel pieces, concentrated ina 
speedVac vacuum centrifuge and dissolved with 15 pl 0.1% trifluoro- 
acetic acid. Nanoflow LC-MS2 analysis was performed with an Ultimate 
3000 liquid chromatography system coupled to an Orbitrap Elite mass 
spectrometer (Thermo-Fisher). Five microlitres of sample was injected 
toaself-packed analytical column (75 mm x 200 mm; ReproSil Pur 120 
C18-AQ; Dr. Maisch) and eluted with a flow rate of 300 nl min“ in an 
acetonitrile-gradient (3-40%). One survey scan (resolution: 60,000) 
was followed by 15 information-dependent productionscansintheion 
trap. The MaxQuant software (1.6.2.6a)"’ was used with default settings 
for database search against a X. laevis database downloaded from Uni- 
Prot.org (Proteome ID: UP000186698 with 42.878 entries; last modified 
October 2018) together with the contaminants database included in 
the MaxQuant software. In addition, acustom-made database (Supple- 
mentary Table 2) was used containing additional database entries for 
y-tubulin complex components not identical with the UniProt entries. 
Trypsin was specified as enzyme. Carbamidomethyl was set as fixed 
modification of cysteine and oxidation (methionine), deamidation 
(asparagines and glutamine) and N-terminal acetylation as variable 
modifications. A false-discovery rate of 1% was used on peptide and 
protein levels. To calculate iBAQ values”°, an additional MaxQuant 
analysis was used withiBAQ calculation enabled. An Andromeda score 
threshold of 40 was set for unmodified peptides to avoid false positives 
for quantification. For calculation of stoichiometries, the calculated 
iBAQ values of the database entries Q5PQ98 (y-tubulin), 073787 (GCP3), 
AOA1L8HOR3 (GCP4), AOAIL8HGZS5 (GCP5), AOAIL8GZ56 (GCP6), 
AOAIL8GY92 (NEDD1), AOAIL8EXC8 (actin) and AOAIL8HF79 (NME7) 
from UniProtKB and XP_018080016.1(GCP2) from the NCBI database 
were used. To account for different degrees of purity among different 
preparations of y-TuRC, data were normalized to the iBAQ values of 
GCP2 and GCP3. Source data for LFQ mass spectrometry are included 
in Supplementary Table 1. 


Immunoprecipitation 

Mycoplasm-free HEK293T cells were bought from ATCC (American Type 
Culture Collection) and grown in Gibco DMEM/F-12 medium (Thermo 
Fisher Scientific) supplemented with 10% fetal bovine serum, 2 mM 
L-glutamine, 100 U mI’ penicillin and 100 pg ml streptomycin at 
37 °C with 5% CO,. Plasmids were transiently transfected at 40-50% 


cell confluency in10-cm dishes with polyethylenimine according to the 
standard protocol. Then, 18-36 h after transfection, cells were scraped 
and lysed with lysis buffer containing 10 mM Tris-HCl, pH 7.5, 150 mM 
NaCl, 0.5% NP40, 0.5 mM EDTA, | tablet per 50 ml protease inhibitor 
cocktail (COmplete, EDTA-free, Roche) and 1:500 Benzonase (Merck 
Millipore). After pipetting 20 times and passing through a 21Gx112 
(0.8 x 40 mm) needle 20 times, the mixture was placed onice for 10 min 
and clarified by spinning for 20 min, 20,000g at 4 °C. The supernatant 
was collected and incubated with 25-30 ul anti-Flag M2 affinity gel 
(Sigma-Aldrich, equilibrated with the lysis buffer) for 2h at 4°C ona 
rotating wheel. The beads were collected by centrifuging at 6,000g, 
4 °C for 30s. After washing once with lysis buffer and three times with 
PBS, the proteins on the beads were eluted by incubating with 2x sample 
buffer at 65 °C for 15 min. 


Silver staining 

Polyacrylamide 10-12% gels were used in this study. For silver staining, 
the gel was incubated for 30 min in fixing solution containing 40% 
methanol and 10% acetic acid. Excess acetic acid was washed away with 
30% methanol followed by three washes with water. The gel was sensi- 
tized for 2 minin 0.02% freshly prepared NaS,O, and then washed with 
water. The staining was performed with 0.2% AgNO, at 4 °C for 25 min, 
and residual AgNO, was washed away with water. The development was 
performed with 6% Na,CO,, 0.05% formaldehyde and 0.0004% NaS,0,, 
and the reaction was terminated with 1.4% EDTA when the staining was 
sufficient. Major bands from the gel in Extended Data Fig. 1d were cut 
out and verified by mass spectrometry. 


Invitro MT nucleation assay 

Unlabelled and Cy3-labelled pig brain tubulin were mixed 24:1 in 1x 
BRB80 buffer (80 mM PIPES/KOH pH 6.8, 1mM MgCl, and 1mM EGTA) 
with a final glycerol concentration of 12.5% (w/v). After spinning at 
352,860g, 4 °C for 5 min with S100-AT3 rotor (Thermo Fisher Scientific), 
2.5 pl supernatant and 2.5 pl y-TURC mixture (0.2 pl y-TuRC, 1mM GTP, 
12.5% (w/v) glycerol in 1x BRB80 buffer) were used, incubated for 15-30 
min onice and transferred to a37-°C water bath for 5 min for MT nuclea- 
tion. The sample was immediately mixed with 50 p11% glutaraldehyde 
in 1x BRB80 buffer and incubated for 5 min at room temperature for 
crosslinking. Then, 1 ml cold 1x BRB80 buffer was added to stop the 
reaction and 50 pl of sample was mounted onto a2 ml 10% glycerol, 
BRB80 cushion ina Corex 15-ml glass tube with a12-mm coverslip sup- 
ported by a glass platform at the bottom. The MTs were sedimented 
onto poly-lysine-coated coverslips by centrifuging with an HB4/HB6 
rotor (Thermo Fisher Scientific) at 23,530g for 1h at 20 °C. After fixing 
for 5 mininice-cold methanol, the sample was mounted ona2 pl drop 
of Citiflour AF1 (Electron Microscopy Sciences) and sealed with nail 
polish. For each sample, 20 random images were acquired using an 
Axiovert 200 M microscope (Carl Zeiss Microscopy) equipped witha 
Plan-Apochromat 63x NA 1.3 oil objective lens (Carl Zeiss Microscopy), 
anda Cascadel K EMCCD camera (Photometrics). Imaging was operated 
with VisiView and the images were processed with Image). 


DNasel inhibition assay 

Purified y-TuRC was incubated with a 100-fold excess of recombinant 
DNasel (Sigma-Aldrich, D5319) for 3 h onice. As acontrol, DNasel and 
G-actin were pre-incubated in a 1:1 ratio for 1h onice, and the mixture 
was subsequently added to the purified y-TuRC. The protocol for the 
in vitro MT nucleation assay was performed as described above except: 
after the addition of 1 ml cold 1x BRB80O buffer to stop the reaction, 
3 pl of sample was squash-fixed ona slide with a12 x 12-mm’ coverslip. 
The imaging protocol was as described above and the MT number was 
counted manually from 20 random fields. For each sample, five random 
images were compiled into a stack and the maximum projection was 
presented in Extended Data Fig. 9b. This protocol is the same for the 
in vitro MT nucleation assay of y-TuRC with CEP215-N. 


Sucrose gradient 

Sucrose gradients were made by a Model 106 Gradient Master (BioComp 
Instruments) according to standard instructions. Purified y-TuRC was 
loaded onto a2.2-ml 5-40% gradient in HB100 buffer with 0.1 mM GTP. 
Centrifugation was performed in PA 7/16 X 2-3/8 tubes (Beckman Coul- 
ter) in the S55-S Swinging-Bucket Rotor (Thermo Fisher Scientific) at 
200,000g for 3 hat 4 °C. The fractions were collected from the top (160 
pl per fraction). In total, 13 fractions were collected and analysed by 
immunoblotting. y-TuRC was found around fractions 8-11. Thyroglobu- 
lin 19S (660 kDa) was used as a molecular mass marker. 


Protein purification 

For expression of the N-terminal 249 amino acids of the xGCP6-IDo, 
pGEX-6P1-xGCP6(546-794)-Strep II was transformed into Escheri- 
chia coli strain BL21-CodonPlus(DE3)-RIL competent cells. The trans- 
formed cells were cultured in 2xYT medium, and protein expression 
was induced with 0.2 mM isopropyl B-D-1-thiogalactopyranoside (IPTG) 
at an optical density of 0.5-0.8. After induction for 5 h at 25 °C, cells 
were collected, washed with PBS and stored at -80 °C. The pellets were 
resuspended in lysis buffer containing 50 mM Tris, 150 mM NaCl, 10% 
glycerol, pH 8,1 mMEDTA, 1MMDTT,5mMATP,2mMMgCL,,0,1% Triton 
X-100, cOmplete EDTA-free protease inhibitor cocktail (Roche) and 
PMSF. Afterwards, the cells were lysed by sonification. The cell lysate 
was clarified by centrifuging at 45,000 rpm for 25 min with Type 50.2 
Ti Rotor (Beckman Coulter). The supernatant was incubated with the 
pre-equilibrated Protino Glutathione Agarose 4B (MACHEREY-NAGEL), 
and the protein was eluted by 3C protease cleavage. The eluates were 
bound to StrepTactin TM Sepharose High Performace (GE Healthcare) 
and eluted with 2.5 mM desthiobiotin. 

For purification of GST, GST-CEP215-N and GST-CEP215(F75A)-N, the 
protocolis as described for xGCP6(546-794) except the induction and 
elution steps. Protein expression was induced overnight at 18 °C, and 
after the purification washing steps, proteins were eluted with 30 mM 
reduced L-glutathione (Sigma-Aldrich) in 50 mM Tris, 150 mM NaCl, 
10% glycerol, pH 8 and 1mM EDTA. Eluted proteins were further puri- 
fied using a Mono Q 5/50 GL anion exchange column (GE Healthcare) 
to remove the reduced L-glutathione. 


Circular dichroism spectroscopy 

Purified xGCP6(546-794) was dialysed against 50 mM Na,HPO,- 
NaH,PO, and 150 mM NaF, pH 8 overnight and concentrated on Vivaspin 
6 centrifugal concentrator (Sartorius) toa final concentration of 2.6 uM. 
The circular dichroism spectra were recorded on aJascoJ715 spectropo- 
larimeter at 25 °C. A cell with a 1-mm path length was used for spectra 
recorded between 190 and 250 nm with sampling points every 0.2nm. 
For each measurement, the spectra represented the average of five 
scans and circular dichroism intensities were expressed in millidegrees. 


CEP215-N pull-down assay 

Purified GST, GST-CEP215-N and GST-CEP215(F75A)-N were conju- 
gated with Protino Glutathione Agarose 4B (MACHEREY-NAGEL) and 
pre-equilibrated with HB100 buffer with O.1mM GTP. Seven and ahalf 
microlitres purified y-TuRC was incubated with the conjugated beads 
inavolume of 90 pl for 2 hat 4 °C. After washing 4 times with washing 
buffer (HB100 buffer with 0.1 mM GTP and 0.01% NP-40), bound pro- 
teins were eluted with elution buffer (HB100 buffer with 0.1 mM GTP, 
0.01% NP-40 and 30 mM reduced GSH) at 4 °C for 30 min. 


CEP215-N activity assay in egg extract 

Freshly prepared CSF-arrested X. laevis egg extracts were mixed with1 
pM Cy3-labelled pig brain tubulin, 15 1M Ran(Q69L), 3 uM GST-CEP215-N 
or GST-CEP215(F75A)-N and incubated onice for 1h. After incubating in 
a20-°C water bath for 15 min, 2 pl of sample was squashed onaslide with 
a12x12-mm/ coverslip to test for aster formation. Ten random images 
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were acquired for each sample using an Axiovert 200 M microscope 
(Carl Zeiss Microscopy) equipped with a 10x objective lens (Carl Zeiss 
Microscopy), and a Cascadel KEMCCD camera (Photometrics). Imaging 
was operated with VisiView. The images were processed with ImageJ 
and the fluorescence intensity of the asters in10 images was quantified. 


Immunofluorescence microscopy 

Purified y-TuRCs from. laevis were spun on12-mm coverslips coated 
with poly-L-lysine (Sigma-Aldrich) at 23.530g, 4 °C for 30 min with an 
HB6 rotor (Thermo Fisher Scientific). After fixation with ice-cold metha- 
nol for 5 min at—20 °C, samples were incubated with PBS containing 10% 
FBS and 0.1% Triton X-100 for 30 min at room temperature and subse- 
quently treated with 1% SDS in PBS for 5 min at room temperature. The 
PBS-washed coverslips were sequentially incubated at room tempera- 
ture with primary antibodies for 1h and Alexa-Fluor-conjugated second- 
ary antibodies for 30 min. A DeltaVision RT system (Applied Precision, 
Olympus IX71 based) equipped with the Photometrics CoolSnap HQ 
camera (Roper Scientific), a100x/1.4 NA UPlanSAPO objective (Olym- 
pus), amercury arc light source and the softWoRx software (Applied 
Precision) were used for imaging. Selected channels, FITC and TRITC, 
with different exposure times according to the fluorescence intensity 
of each protein, were applied. Images were acquired by softWoRx soft- 
ware (Applied Precision) and analysed by Fiji. Each group included at 
least 15 images in each independent experiment. For quantification, 
one randomly selected image was used as a reference for adjusting the 
brightness and contrast levels to optimal values and all images were 
analysed using the same settings. GCP6 and actin signals from images 
in each experiment were counted and used for quantification. 


Actin polymerization assay 

Pyrene-labelled actin (Hypermol) was dissolved in general actin buffer 
(5mM Tris, pH 8.0, 0.2 mM CaCl,, 0.2 mM ATP and 0.5 mM DTT) toa final 
concentration of 2 uM. The reaction contained 20 ul pyrene-labelled 
actin, actin polymerization buffer (final concentration with10 mM Tris, 
pH7.5,50 mM KCI,2mM MgCl, and1mM ATP) and either 2 ul purified 
y-TuRC (final concentration 0.5 nM) or 2 pl y-TuRC elution buffer. The 
controls contained pyrene-labelled actin, actin polymerization buffer, 
y-TuRC elution buffer, Arp2/3 (0.5 nM final concentration, Hypermol) 
and VCA (15 nM, Hypermol). Fluorescence was measured at 25 °C using 
a plate reader every 1 min for a period of 100 min (CLARIOstar, BMG 
Labtech, excitation, F:360-10, emission, F: 450-10). 


Negative-staining electron microscopy and image analysis 
Negatively stained samples were generated as follows: 5 pl of the puri- 
fied y-TuRCs were applied on a glow-discharged copper-palladium 
hexagonal 400 EM mesh grid covered with an approximately 10-nm- 
thick continuous carbon layer. After a 30 s incubation, the sample was 
blotted ona Whatman filter paper 50 (1450-070) and washed with three 
drops of water. Samples on grids were stained with 3% uranyl acetate in 
water. Images were acquired onan FEI Tecnai F20 electron microscope, 
operated at 200 kV, equipped with a field emission gun and bottom- 
mounted 4 K camera. The micrographs were acquired at 50,000x mag- 
nification by the SerialEM software, resulting in 2.27 A per pixel. For 2D 
classification, y-TuRC particles were selected manually using the Boxer 
of EMAN2”. Image processing was carried out using the IMAGIC-4D 
package. Particles were band-pass-filtered and normalized in their 
grey value distribution, and mass-centred. Two-dimensional alignment, 
classification and iterative refinement of class averages were performed 
as previously described”*. Approximate number of total particles for 
2D classification and averaging: Fig. 3d, ctrl, 5,400; Fig. 3d, mild wash, 
3,100; both Extended Data Fig. 1h, i, 2,200. For Fig. 3d and Extended 
Data Fig. 1h,a46.5-nm mask was used for averaging; for Extended Data 
Fig. 1i, a14.5-nm mask was used for averaging. Particles included in 
the representative classes of Fig. 3d: 14 spokes, left, 115, right, 139; 12 
spokes, left, 67, right, 59; 10 spokes, left, 76, right, 69. 


For y-TuRCs incubated with or without CEP215-N, the negative-stain- 
ing sample preparation and imaging were done as described above. 
For 3D classification and 3D reconstruction of y-TuRC with or with- 
out CEP215-N, particles were manually located on the micrographs as 
described above. Micrographs and particle coordinates were subse- 
quently imported into Relion 3.0 beta and particles were subjected to 
3D classification into three classes using a scaled density of the y-TuRC 
cryo-EM reconstruction as initial reference. For the y-TuRC without 
CEP215-N, we retained all 2,205 manually selected particles and sub- 
jected them to 3D autorefinement. For the y-TuRC with CEP215-N, we 
retained 1,581 of the 2,057 manually selected particles for 3D autore- 
finement. To compare the overall structure and conformation of the 
two resulting 3D reconstructions, we superposed them according to 
the first four spokes using the Fit in map command in UCSF Chimera. 
To compare the structure of the CEP215-N-y-TuRC density with the 
extrapolated structure of the active y-TuRC, we simulated the density 
at 30 Aresolution from the atomic model in UCSF Chimera and super- 
posed it to the negative stain 3D reconstruction of the CEP215-N-y-TuRC 
complex as described above. 


Cryo-EM sample preparation 

Homemade graphene oxide holey carbon grids (Cu R2/1; 300 mesh) 
were glow-discharged using a Gatan Solarus 950 plasma cleaner for 
20s. Cryo-EM grids were prepared using a Thermo Fisher/FEI Vitrobot 
Mark IV operated at 22-25 °C and 60-70% humidity. Four microlitres of 
purified y-TuRCs was applied to the EM grids within the climate cham- 
ber of the Vitrobot. After a waiting time of 30s, grids were blotted with 
Whatman filter paper no. 1 for 5 or 10s and plunge-frozen in liquid 
ethane bath cooled by liquid nitrogen. 


Cryo-EM data acquisition 

Four datasets were acquired using SerialEM~ ona Titan Krios TEM 
(Thermo Fisher/FEI) operated at 300 kV and equipped with a K3 camera 
(Gatan) operated in dose fractionation mode. Datasets 1-3 were col- 
lected at an object pixel size of 2.1A per pixel with cumulative doses 
of 35 eA (70 frames), 42 e A2 (60 frames) and 51e A? (46 frames), 
respectively. Dataset 4 was collected at an object pixel size of 1.35A 
per pixel with acumulative dose of 57 e A? (39 frames). For each prese- 
lected hole, defocus was adjusted automatically before acquisition of 
two (datasets 1-3) or four (dataset 4) frame stacks per hole. Data were 
collected in a defocus range of -0.5 to -3.5 pm. 


Data processing 

Allimage processing steps are summarized in Extended Data Fig. 2. 
Unless stated otherwise, image processing was carried out in RELION 
3.0-beta’>. Datasets 1-3 were initially processed separately, but follow- 
ing anidentical workflow, whichis described below. The individual num- 
bers of frame stacks and particles retained after each processing step, 
as well as resolution of intermediary cryo-EM densities is included in 
Extended Data Fig. 2. Frame stacks were motion-corrected with Motion- 
Corr2” using 5 x 5 patches. The contrast transfer function (CTF) of the 
motion-corrected micrographs was estimated using gCTF”’. In total, 
3,000 particles were manually selected and used to generate a purely 
data-driven initial 3D density of the y-TuRC in Relion with standard 
parameters. Particles were autopicked in RELION using this initial 3D 
density, from which reference projections were automatically gener- 
ated. Localized particles were extracted at 4.2A pixel size in boxes of 128 
x 128 pixels. Particles were subjected to two rounds of 3D classification 
with standard parameters to successively remove false-positive par- 
ticles, broken y-TuRCs and classes with strong orientational bias. The 
subset of particles retained after the second round of 3D classification 
were recentred, extracted at full spatial resolution in boxes of 256 x 256 
pixels and subjected to 3D autorefinement using solvent-flattened Fou- 
rier shell correlation (FSC) and otherwise standard parameters. After 
refinement, particles were subjected to per-particle CTF refinement 


(including beam tilt estimation for individual datasets) and Bayesian 
polishing”’ trained on 1,000 particles. 

Dataset 4 was the largest, and autopicked particles were divided 
into four different subsets to speed up initial 3D classification steps. 
However, following the classification strategy described above, 3D 
classes of sufficient quality could only be obtained for two of the four 
subsets as the ratio of true- to false-positive particles after autopicking 
was much lower than for datasets 1-3. This was owing to the generally 
lower quality of the y-TuRC purification for this dataset and the smaller 
field of view—both resulting in a very low number of particles on each 
micrograph. To retrieve particles from the two subsets of data that 
could not be classified successfully using the standard approach, we 
added particles belonging to the high-quality 3D classes from the other 
two subsets as ‘nucleators’ for 3D classes with sufficient quality but 
removed them afterwards. Merged particles from all four subsets were 
subjected to two additional rounds of 3D classification to retrieve the 
final set of particles from dataset 4. These particles were extracted at 
2.1A pixel size in boxes of 256 x 256 pixels and then processed identi- 
cally to datasets 1-3. 

A total of 46,096 polished particles from all four datasets were 
merged and subjected to 3D autorefinement, which then served as a 
basis for several rounds of 3D multibody refinement”. Using the cryo- 
EM density from 3D autorefinement, we prepared two sets of shape 
masks either (i) dividing the y-TuRCs into seven dimers of successive 
spokes, or (ii) splitting the density into four segments, in which two 
segments represent the spoke ‘heads’ (C-terminal part of GRIP2 plus 
y-tubulin) for spokes 1-7 and 8-14, respectively, and two segments 
represent the spoke ‘base’ (GRIP1 and N-terminal part of GRIP2) for 
spokes 1-7 and 8-14, respectively. All masks were generated by choos- 
ing an appropriate density threshold level for each density segment and 
extending the density envelope by 5 pixels and a soft edge of 5 pixels. 
Both sets of masks were used for a separate round of 3D multibody 
refinement. While 3D multibody refinement using the first set of masks 
resulted inmuchimproved density for the spoke heads, the second set 
of masks yielded optimal refinement for the spoke bases. The resulting 
unfiltered density segments were merged into acomposite map using 
UCSF Chimera*°. Global resolution of the final density was estimated 
4.9 A; however, local resolution estimation indicated significantly lower 
resolution for dimers 2 (spokes 3/4) and 4 (spokes 7/8). To improve the 
density for these map segments, we subjected the particles to another 
round of 3D-multibody refinement, in which the respective dimers were 
split into independently refined monomers. Using this approach, we 
achieved improvement of the local resolution for these density seg- 
ments, which improved the resolution globally to 4.8 A. 

Allresolution estimates were performed according to the gold stand- 
ard FSC criterion of independently refined half maps (FSC = 0.143) 
within RELION. Local resolution was estimated using the RELION local 
post-processing implementation. The B-factor value used during local 
resolution filtering was —300. 


Grouping of y-TuRC spokes into structural clusters 
First, GCPs were clustered according to cross-correlation between 
density segments. The y-TuRC density was segmented into 14 spokes, 
comprising one GCP plus one y-tubulin each. Pairwise cross-correlation 
between these segments was then computed in UCSF Chimera using 
the Fit in the Map command taking into account all density voxels. 
Second, GCPs were clustered according to the r.m.s.d. values of 
docked atomic models. The models for human GCP4 (PDB code 3RIP) 
and y-tubulin (PDB code 1Z5W) were docked into all 14 spokes of the 
y-TuRC density. The GCP4 atomic model was split into four rigid bodies 
to account for interdomain conformational flexibility: Metl-Lys147, 
Ile148-Tyr361, Leu362-Lys505 and SerS06-GIn636. The very C-terminal 
helix Ile637-Tyr654 was manually moved into the density for the first 
spoke and then kept rigid with the last segment for the other positions. 
Docked GCP4 and y-tubulin models were then combined into one PDB 


file for each spoke and pairwise r.m.s.d. values were computed using 
the match command in UCSF Chimera. The average r.m.s.d. values 
computed between members of a specific cluster were at least two times 
smaller than the average r.m.s.d. values computed between members 
and nonmembers of a specific cluster (group (i): 0.75 A versus 1.78 A; 
group (ii): 0.60 A versus 1.80 A; group (iii) 1.61 A versus 3.64 A). 


Identification of GCP-variant-specific structural features 
Density segments covered by atomic models for GCP4 and y-tubulin 
(prepared as described above) were annotated using the colour zone 
feature in UCSF Chimera. The remaining density segments for all 
spokes were compared against each other and correlated with pri- 
mary sequence information on the different GCP variants guided by 
the docked GCP4 model, as described in the result section. Secondary 
structure prediction was performed in PSIPRED vs3.3”” and multiple 
sequence alignment was performed in PROMALS*®. 


Atomic modelling 

The crystal structure of human GCP4 (PDB code 3RIP) was used as 
template for homology modelling of X. laevis GCPs on the Phyre2 web 
portal* with one-to-one threading. The following UniProt sequences 
were used: XP_018080012.1 (GCP2), 073787 (GCP3), Q642S3 (GCP4), 
XP_018102626.1 (GCPS) and Q9DDA7 (GCP6, excluding the insertion 
domain sequence Leu532-Arg1260). The crystal structure of human 
y-tubulin (PDB code 1Z5W) was used as template for homology model- 
ling of X. laevis y-tubulin with UniProt sequence P23330. The homology 
models were subsequently docked into the y-TuRC density map using 
UCSF Chimera. Each model was split into several rigid bodies to account 
for interdomain conformational flexibility. The rigid bodies consisted 
of: Leu209-G364, Tyr365-Glu506, Glu507-Ser668, Ser671-Ser714, 
Ala715-Tyr867 for GCP2; Glu246-Gly389, Arg390-Tyr552, Asn553- 
Lys691, Gly692-Tyr885 for GCP3; Met1-Lys147, Ile148-Asp349, Ile350- 
Lys505, Ser506-Tyr654 for GCP4; Thr259-Pro421, Asp422-Leu723, 
Leu724—Asn847 and Lys848-Ala1014 for GCP5; GIn269-Ala421, Gly422- 
Leu1015, Lys1016-Ser1464, Asn1465-Tyr1622 for GCP6. The last three 
C-terminal helices had to be manually positioned for all homology 
models, because their arrangement differed significantly in the X-ray 
structure. Missing or incorrectly localized segments of the homol- 
ogy models were appended or adjusted in Coot”, where the density 
allowed. This includes Leu215—Cys209 and Thr605-Gly611 in GCP2; 
Leu791-Glu815, Val818-Ile839, Glu246-Ser248 in GCP3; Thr259-GIn265 
and Arg571-Leu582 after removal of Arg571-Ser670 in GCP5; GIn269- 
Asp279 and Ser556-Met561 after removal of Phe532-Leu557 in GCP6. 
Unassigned, but clearly resolved, a-helices inthe belt density were built 
as poly-alanine helices in Coot. In total, 17 helices were built with the 
following number of residues: 16, 19, 15, 13, 13, 17, 14, 15, 16, 16, 14, 13, 
23, 15, 18, 13 and 31. Parts of the homology models not resolved in the 
cryo-EM reconstruction were removed. Rigid-body-fitted homology 
models were combined in Chimera into one model and main chain 
breaks caused by domainwise rigid-body fitting were repaired in Coot. 
The model was refined in real space against the cryo-EM density and 
structural clashes were removed using molecular dynamics flexible 
fitting (MDFF*°). MDFF simulations were prepared using QwikMD” 
and carried out with NAMD* using the CHARMM36 force field. Sec- 
ondary structure, cis peptide and chirality restraints were used during 
800 steps of minimization followed by a 40-ps simulation at 300 K. 
The resulting model was refined in Phenix 1.14 using the initial model 
as reference and then submitted to the NAMDinator website tool” 
that combines molecular dynamics flexible fitting (MDFF) with real- 
space refinement in Phenix. MDFF was performed at 298 K using 2,000 
minimization steps and 20,000 simulation steps. Simulation was run 
in vacuo with scaling factor of 0.3. Subsequently, the fit of resolved 
bulky amino acid side chains was inspected and corrected in Coot, if 
required, before the model was refined in Phenix. This procedure was 
repeated twice with two slightly different refinement parameters to 
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obtain the final model. Model validation was performed in Phenix 1.14*°. 
The register of the atomic model was confirmed by the following list of 
resolved bulky amino acid side chains (spoke number/residue/unique 
as defined in Extended Data Fig. 5): 1/His359/-, 1/Tyr377/-, 1/Tyr455/*, 1/ 
His725/-,3/Phe275/, 3/Trp324/*, 3/Tyr455/*, 3/His724/-, 5/Trp324/**, 5/ 
Phe325/**, 5/Arg486/-, 5/Tyr493/-,5/Phe501/-, 5/His643/-,5/Tyr646/*, 5/ 
His649/-, 5/His724 /-, 7/His289/-, 7/Trp324/*, 7/His359/**, 7/Phe363/**, 7/ 
Tyr455/*, 7/Tyr491/-, 7/His511/-, 7/Trp705/-, 13/Phe275/-, 13/Trp324/*, 13/ 
His359/-,2/Trp298/-, 2/His386/-, 2/His658/*, 2/Phe665/-, 4/Trp298/-, 4/ 
Phe369/**, 4/Phe371/**, 4/His416/-, 4/Trp667/-, 6/Trp298/-, 6/His342/-, 
6/Phe369/-, 6/Trp371/-, 6/Tyr423/*, 6/Phe491/*, 6/Phe529/-, 6/Phe538/-, 
6/Phe665/-, 6/Trp680/-, 10/Phe371/*, 10/His409/*, 10/Tyr437/-, 10/ 
Phe458/**, 10/Trp461/**, 10/Tyr703/-, 10/Tyr816/-, 10/Phe820/-, 8/ 
Trp298/-, 8/Tyr306/-, 8/Phe323/-, 8/Tyr334/™, 8/Tyr335/**, 8/His342/-, 
8/Phe371/-, 8/His386/-, 8/His416/-, 8/Tyr423/*, 8/Phe529/-, 8/Tyr543/*, 
8/His557/-, 8/Tyr719/-, 8/His750/-, 8/His851/-, 14/Trp298/-, 14/Tyr402/-, 
14/His493/-,14/Tyr 537/-, 14/Phe538/-, 14/Phe852/-, 9/Lys47/*, 9/Arg54/-, 
9/Phe55/**, 9/Phe58/**, 9/Tyr124/-, 9/Phe129/-, 9/Tyr158/-, 9/Tyr184/-, 
9/His390/-, 9/His401/-, 9/Phe478/-, 9/His 560/-, 9/His562/*, 9/His580/*, 
9/His590/*, 11/Phe58/*, 11/His121/*, 11/His130/*, 11/Tyr158/-, 11/Tyr184/-, 
11/Trp190/-, 10/Trp272/-, 10/Tyr316/-, 10/Trp362/*, 10/Phe368/-, 10/ 
His925/-, 10/His933/*, 12/His340/-, 12/Tyr367/-, 12/Tyr400/*, 12/ 
Tyr431/**, 12/Tyr433/**. 


Unbiased structure-guided identification approach 

For the list of proteins identified in the purified y-TuRC by LFQ mass 
spectrometry, we either downloaded X-ray structures from the PDB 
or—where possible—prepared high-confidence homology models in 
HHpred. Subsequently, the fitmap command of UCSF Chimera was 
used to fit these atomic models into the envelope of the cryo-EM den- 
sity, starting from 10,000 randomly sampled starting positions and 
orientations for each model. The following parameters were used: cor- 
relation metric with simulated densities at 6 A resolution, 50 A radius 
around the starting position, global search. For each relevant model, 
the number of fits was plotted against the respective cross-correlation 
coefficients (Extended Data Fig. 6). Atomic models for NEDD1, NME7 
and MOZARTI1 were limited to protein segments with high homology 
to known structures (1-361 for NEDD1; 26-58 for MOZART1; 91-226 
and 235-373 for NME7). The following list of proteins could not be 
localized in the y-TuRC density even though highly abundant in the 
y-TuRC purification according to LFQ mass spectrometry (UniProt 
code/protein name/template probability from homology model- 
ling/LFQ intensity): AOA1L8H345/dynamin-1/100/8.72E+09, Q8AVE2/ 
Hsc70 protein/100/5.89E+09, AOAIL8FKY3/Hsp70/100/5.5E+09, 
AOAIL8HW84/tight junction protein ZO-3/100/5.47E+09, Q7ZTNI1/ 
tight junction protein ZO-3/100/5.42E+09, AOAIL8GWY3/protein 
transport protein Sec24a/100/5.26E+09, AOAIL8FW10/insulin-like 
growth factor 2 mMRNA-binding protein 3/99.95/4.34E+09, AOAIL- 
8EKZ2/polyadenylate-binding protein/100/4.32E+09, AOAIL8G1U5/ 
protein transport protein SEC23/100/3.92E+09, AOA1L8ES55/ 
polyadenylate-binding protein/100/3.72E+09, AOAIL8H4P1/CSD 1 
domain-containing protein/99.35/3.36E+09, AOAIL8G7U0/protein 
transport protein SEC23/100/2.75E+09, AOAIL8GRB6/uncharacter- 
ized protein/100/2.47E+09, AOAIL8GY92/ WD_REPEATS_REGION 
domain-containing protein/99.96/1.77E+09, AOAILS8HEX9/unchar- 
acterized protein/100/1.67E+09, AOAIL8FA78/heat shock-related 
70 kDa protein; signalling protein/100/1.62E+09, AOAIL8F1HS/ 
dynamin-1/100/1.47E+09, AOAIL8HWC1/uncharacterized 
protein/100/1.45E+09, AOAIL8GMZ9/dynamin-1/100/1.39E+09, 
Q8AVK9/NSEP1 protein/99.33/1.15E+09, AOAIL8F457/protein transport 
protein SEC23/100/1.07E+09, AOAIL8FAZ8/protein transport protein 
SEC23/100/1.01E+09, AOAIL8GWQ5/WD repeats region domain-con- 
taining protein/100/8.22E+08, AOAIL8HF79/nucleoside diphosphate 
kinase 7/99.92/7.27E+08, AOAIL8G3Y8/fragile X mental retardation 
syndrome-related protein 1/100/7.13E+08, A3KMH8/fragile X mental 


retardation syndrome-related protein 1/100/7.11E+08, AOAILSEWC9/ 
interferon-inducible double-stranded RNA-dependent protein kinase 
activator A/100/6.98E+08, AOAIL8FKW5S/serine/threonine-protein 
kinase TOR/100/6.7E+08, AOA1L8F613/HSPAS protein/100/6.64E+08, 
AOA1L8GQQ7/uncharacterized protein/100/5.84E+08, AOAIL8HM56/ 
protein transport protein Sec24B/100/5.23E+08, AOAIL8FTJ1/polyade- 
nylate-binding protein/100/4.81E+08, AOAIL8FZR3/polyadenylate- 
binding protein/100/4.81E+08, AOA1L8EM44/y-actin/ 100/4.77E+08, 
AOAIL8ETES/uncharacterized protein/100/4.77E+08, AOAIL8HRTO/ 
DZF domain-containing protein/100/4.46E+08, AOAIL8GT63/tightjunc- 
tion protein ZO-1/100/4.42E+08, Q6GMC1/ubiquitin-40S ribosomal pro- 
tein S27a/100/4.27E+08, AOAIL8GS5V1/ubiquitin-like domain-containing 
protein/100/4.27E+08, AOAIL8HQK4/ polyubiquitin-C/100/4.26E+08, 
Q7SY79/Ubc-prov protein/100/4.26E+08, Q6GQF3/ ubiquitin-60S ribo- 
somal protein L40/100/4.26E+08, AOAIL8HX68/ubiquitin-like domain- 
containing protein/99.94/4.26E+08, AOAIL8HCZ9/uncharacterized 
protein/100/4.26E+08, AOAIL8H6E1/nucleoside diphosphate kinase 
7/99.92/4.13E+08. 


Analysis of geometrical parameters 

For analysis of geometrical parameters of the y-TuRC, we first com- 
puted the (approximate) helical axis for each of the analysed models 
by fitting a straight line through the centroids described below. For the 
y-TuRC, we defined centroids for five different sets of atoms along the 
spokes (Fig. 4a) using UCSF Chimera. Two points were defined on the 
y-tubulin subunit of each spoke (y-tubulin Thr145, y-tubulin Tyr152), 
while the other three points were defined based on aligned amino acids 
of the GCPs (point 1: Leu554 of GCP3, Leu508 of GCP2, Met590 of GCP6, 
Leu350 of GCP4, Leu711 of GCP5; point 2: Pro408 of GCP3, Ser369 of 
GCP2, Tyr445 of GCP6, Pro166 of GCP4, Glu451 of GCP5; point 3: Leu249 
of GCP3, Leu216 of GCP2, Met1 of GCP4, Val266 of GCP5, Leu280 of 
GCP6). For the 13-spoked MT (based on PDB 6EVZ), three centroids 
were defined based on Gin15 of three different layers inthe MT. For the 
yeast y-TuSC in the closed conformation (PDB 5FLZ), three centroids 
were defined based on yeast y-tubulin GlIn12, Asn389 of GCP2/Asn444 
of GCP3 and Phel102 of GCP2/Phe216 of GCP3. All three models were 
subsequently aligned according to their central axis and coordinates 
were transformed such that the axes correspond to the z-axis of the 
coordinate system. Cartesian coordinates of tubulins were determined 
based on conserved residues (GIn15, GIn16 and His16). On the basis of 
these coordinates, we computed the pitch (Fig. 4b) for each spoke and 
radial distances (Fig. 4c). 

Using UCSF Chimera, refined homology models for GCPs and 
y-tubulins (see above) were docked as rigid bodies into the cryo-EM 
density of the yeast y-TuSC oligomer in a closed conformation (EMD- 
2799). The atomic models of the X. [aevis y-TuRC inthe experimentally 
observed conformation and the simulated closed conformation were 
aligned with respect to spoke 1 using UCSF Chimera. The conforma- 
tional changes linking both models were captured in motion by inter- 
polating between the two conformations using UCSF Chimera’s morph 
conformations function. In an alternative approach to visualize the 
conformational change, we used PyMol“to draw vectors between Ca 
atoms inthe two conformations and colour-coded them according to 
their r.m.s.d. values. 

To analyse relative GRIP1-GRIP2 domain inclinations, we computed 
their axes for each spoke. To define the GRIP1 domain axis, we used 
the following corresponding residue pairs: Ala266/Gly453 for GCP2, 
Gly297/Gly486 for GCP3, Gly49/Gly280 for GCP4, Gly317/Gly559 for 
GCPS and Gly332/Gly526 for GCP6. To define the GRIP2 domain axis, 
we used the following corresponding residue pairs: LeuSO8/Phe757 
for GCP2, Leu554/Phe784 for GCP3, Leu350/Phe592 forGCP4, Leu711/ 
Phe959 for GCP5 and Met1319/Phe1553 for GCP6. The relative inclina- 
tions of GRIP1 and GRIP2 domains were measured in UCSF Chimera 
with the angle command. The GRIP1-GRIP2 inclinations were 
computed separately for each spoke and then quantified for each GCP 


variant: five copies of GCP2, five copies of GCP3, two copies of GCP4, 
one copy of GCP5 and one copy of GCP6. The average angle for each 
GCP group was calculated as mean +s.e.m. 

To analyse interspoke distances, we computed distances between 
conserved residues inthe GCP GRIP1 and GRIP2 domains or y-tubulins 
of neighbouring spokes in UCSF Chimera. The following residues were 
used: Ala266 of GCP2, Gly297 of GCP3, Gly49 of GCP4, Gly317 of GCP5 
and Gly332 of GCP6 for the GRIP1 domain; Phe757 of GCP2, Phe784 of 
GCP3, Phe592 of GCP4, Phe959 of GCP5 and Phe1553 of GCP6 for the 
GRIP2 domain; Asn187 for y-tubulin. 


Research animals 


The permission numbers for X. laevis experiment are: 35-9185.81/G- 
204/12 (Regierungsprasidium Karlsruhe, BW); 81-02.05.40.17.091 
(LANUF Recklinghausen, NRW). Ethic justification of the experiment 
is according to § 7a Abs. 2 Nr. 3 TierSchG. We have complied with all 
relevant ethical regulations. The injection of hormones into frogs mim- 
ics the physiological stimulus upon reproduction with a negligible 
burden for animals. In turn, the experiments in X. laevis using laid eggs 
and derived extracts have contributed ground-breaking knowledge to 
basic mechanisms of cell division, cell cycle control and MT functions. 
The latter has led, for example, to the discovery and characterization 
of cytostatic drugs such as taxol (Paclitaxel). These experiments are 
therefore performed with an optimal cost-benefit ratio, and show 
hardly any harm to animals while allowing great potential for gaining 
knowledge for basic and applied science. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 
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Cryo-EM densities of the y-TuRC filtered according to global or local 
resolution have been deposited in the Electron Microscopy Data Bank 
(EMDB) under accession code EMD-10491. Atomic coordinates for the 
y-TuRC have been deposited at the PDB under accession code 6TF9. 
The original immunoblots and further source data from LFQ mass 
spectrometry (Fig. 2a), immunoblot quantification (Fig. 3c), geometric 
analysis of the atomic model (Figs. 3g, 4b-d), MT nucleation assays 
(Extended Data Figs. Ic, 6i, 9b, d), circular dichroism measurements 
(Extended Data Fig. 4e), unbiased structure-guided identification 
(Extended Data Fig. 6a—d), quantification of indirect immunofluo- 
rescence (Extended Data Fig. 6g) and actin polymerization (Extended 
Data Fig. 6h) are included in the Supplementary Information. The raw 
cryo-EM micrograph movie stacks are available from the correspond- 
ing authors upon request. 
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Extended Data Fig. 1| Biochemical, functional and structural 
characterization of y-TuRCs purified from X. laevis egg extracts. a, Domain 
organization of GCP2-GCP6. Between the conserved GRIP1and GRIP2 
domains, GCP5 and GCP6 possess a120- and 750-residue-long insertion 
domain, respectively. The GCP6 insertion domain contains 8 repeats of 27 
amino acids. Domains are annotated according to the Pfam database. 

b, Schematic of y-TuRC purification. The y-TuRC was purified with y-tubulin 
antibody-crosslinked Protein A Dynabeads, washed with CSF-XB buffer 
containing 250 mM salt (KCI), and then eluted bya short peptide 
corresponding to the C terminus of y-tubulin. c, Purified y-TuRCs showed basal 
MT nucleation activity. Experiment was carried out with (+) or without (-) 
purified y-TuRCs, and 33 pM tubulin (5% Cy3-labelled tubulin for visualization). 
In the negative control, the same purification procedure was used with eluates 
from rabbit random IgG-crosslinked Protein A Dynabeads. Twenty random 
images were acquired witha light microscope, and representative overview 
images are shown. Right, the number of MTs was quantified by ImageJ and data 
are mean+s.d.n=4 biologically independent experiments. Pvalue determined 
by unpaired two-sided t-test. Scale bar, 10 pm. d,e, After the affinity 
purification of y-TuRCs, the eluted proteins were resolved by SDS-PAGE 
followed by silver staining (d) and immunoblotting (e). Representative images 


ind andeare from three biologically independent experiments. For gel source 
data, see Supplementary Fig. 1. f, Immunoblotting analysis of the purified 
y-TuRCs after sucrose gradient. Purified y-TuRCs were applied to a5-40% 
sucrose gradient and fractionated after centrifugation. Fractions were 
resolved by SDS-PAGE and probed using y-tubulin and GCPS antibodies. 
Thyroglobulin (19.4 S) was used as astandard marker and run ona parallel 
gradient. Representative images were from three biologically independent 
experiments. For gel source data, see Supplementary Fig. 1.g, Confirmation of 
structural integrity of the purified y-TuRCs by negative-staining electron 
microscopy. Representative micrograph is from five biologically independent 
experiments. Scale bar, 100 nm. Black arrowheads denote examples of 
particles used in 2D classification and averaging. h, i, y-TuRC particles from the 
negative-stain electron microscopy were classified and averaged with a mask 
size of either 46.5 nm (h) or 14.5nm (i). Representative classes of y-TuRCs are 
from three biologically independent experiments. The number of particles 
contributing to each class is given. An example of the ‘asymmetric’ density 
inside the y-TuRC is highlighted by a white arrow (h), whichis more readily 
visible with a 14.5-nm mask focusing on the inner part of the y-TuRC (i). Scale 
bars,20nm. 
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Extended Data Fig. 2| Cryo-EM data processing and resolution estimation. 
a, Arepresentative micrograph from four biologically independent 
experiments with manually selected particles (green circles) is shown. Three 
selected particles are shown below. b, Initial model of the y-TuRC obtained 


from 3,000 manually selected particles (Methods). Scale bars are depicted ina. 


c, Four datasets were acquired and initially processed separately. Datasets 1, 2 
and 3 were submitted to two consecutive rounds of 3D classification witha 
varying number of classes. All class averages obtained for dataset 1 are shown. 
Retained class averages are highlighted by a rectangular box. d, Dataset 4 was 
the largest dataset, and was therefore divided into four subsets. The higher- 
quality subsets were submitted to two rounds of 3D classification and particles 
encompassed in the high-quality classes from these two subsets were 
combined with all four original subsets of particles to nucleate high-quality 
classes in the two lower-quality subsets. Only class averages retained for 
further processing are shown. e, The final sets of particles from all four 


datasets were refined separately, submitted to CTF refinement and Bayesian 
polishing, and subsequently merged. The y-TuRC density was split into several 
combinations of segments, as indicated. All combinations of segments were 
subjected toa multibody 3D refinement separately. Their output density 
segments were combined into one composite density for analysis. The angular 
distribution of particle views is shown. f, Comparison of local resolution 
estimation before (left) and after (right) 3D multibody refinement. Local 
resolution was markedly improved for the peripheral segments of the y-TuRC 
density reconstruction upon 3D multibody refinement. The colour-coded 
resolution scale is identical for both panels. g, Local resolution estimation of 
the final y-TuRC density map witha resolution range covering the entire 
spectrum. h, Mask-corrected FSC between the two independently refined half- 
set reconstructions (purple), and between the full reconstruction and the 
atomic model for the y-TuRC (green). 
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Extended Data Fig. 3 | Structural grouping and unique GCP-variant-specific 
features identify GCP proteins in the y-TuRC. a, Pairwise cross-correlation 
between isolated density segments, colour-coded from higher (red) to lower 
(blue) correlation. The actual correlation values are given. b, Atomic models of 
human GCP4 and y-tubulin were fitted into the 14 y-TuRC spokes domainwise 
(Methods). Pairwise r.m.s.d. between Ca atoms of atomic models representing 
the individual spokes, colour-coded from lower (red) to higher (blue) r.m.s.d. 
values. The actual r.m.s.d. values are given. Both approaches cluster the spokes 
into five classes, colour-coded inthe left column and top row. c, d, Atomic 
models for human GCP4 (white) were fitted domainwise into the density. 


Density segments covered by the atomic model are depicted in transparent 
grey. Remaining unexplained segments are depicted in colour. Features are 
shown for all 14 individual spokes of the y-TuRC. c, Characteristic density 
segments of the GRIP2 domain. Extended C-terminal a-helices (red model and 
density) are unique for group (ii) (spokes 2, 4, 6, 8,14), and an extended loop 
between the GRIP2 B-strands (blue model and density) is present only in group 
(i) (spokes 1, 3, 5, 7,13). d, Unexplained density segments N-terminal of the 
GRIP1 domain. Only group (iii) (spokes 9, 11) is devoid of acontinuous density 
connecting to the N-terminal helix of GCP4 (position 9 and 11; lack of yellow 
extension). Colour codeas in Fig. la. 


Article 


a 


GCP2 731 
GCP3 756 
GcP4 566 
GCP5 931 
GCP6 1525 
GCP2 799 
GCP3 824 
GcP4 604 
GCP5 971 
GCP6 1568 
Low confidence | 
b 

GCP2 544 
GCP3 590 
GCP4 386 
GCP5 747 
GCP6 1358 
GCP2 612 
GCP3 634 
GCP4 449 
GCP5 791 
GCP6 1405 
c 

GCP5 571 
GCP5 641 
d 

GCP6 546 
GCP6 616 
GCP6 686 
GCP6 756 
e 

25 
20 


Intensity (mdeg) 


190 


200 


KOMAVHB-----------------_ BF cL 


WH High confidence 


DIIPTRLEALLELALRMSTANTDPFKDD- DPTELALSG 611 
TLY QHNLTGILETAVRATNAOFDNPEI LKRLDVR- —- —- -------------------- - LLEVSPGDTG 633 
AVTEHDVNVAF OLSAHK VLLDDDNLLP TSQPREGPFRDMSPREAPTSG 448 
WLNLS YLNVOIQEAVGORY PDDSTR-- — -LSVS----- ----------------FE LPVHT 790 
LLNPLVLNS ILNKALOY SLHGDS SLAS N- -DfPA- —- — ----------------EKYLPEVFTPTAPDA 1404 
LES VKWPLS LI INRKALTRYOMLFRHMFY CKHVERLLCNVWWISNKTAKOFSLH — —-------- 670 
WDV VDGP IATVFTRECMSH YLRV FNFLWRAK RMEY IL TD IWKG HMCNAKLLKGMP —- —---—-- 694 
wW VOWPLH ILFTPAVLEK YN WFKYLLS VRRV OSELOHCWALOMORKHLE-- — —-----—-- 505 


LDG 
LSC 


VPWP VD IV ISSECOKI YNOVFLLLLLIKWAKYSLDVLOFNELGNASENESTKEGATVEPFP 860 
VDWP LN IV ITDTCMNKYS RI FS FLLOLK HM VWTLRD VWFH LK RTALVNOAS - —- -—- -- -—-- 1464 


RITALOODSSRDSORASL YTLF LESUGSIR:. OHGHDSYP OREBBQOVNKLSLTXMOSEVAKEBELDEVHDBL 640 
L DF LETFTCNEVCVD 670 
615 
685 
755 


CD spectra 
BSA —— 


xGCP6%"™ —— 
buffer 


xGCP6“*"™ 


28 


17 


210 220 


Wavelength (nm) 


230 


240 


250 
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Extended Data Fig. 4 | Secondary structure prediction and biochemical 
characterization for GCP variant-distinguishing regions and the GCP5 and 
GCP6 insertion domains. a, C-terminal segments of all GCP variants were 
aligned and the secondary structure was predicted. Confidence for a-helical 
secondary structure is colour-coded (blue denotes low confidence, red 
denotes high confidence). b, Multiple sequence alignment for the GRIP2 
segments encompassing the inter-B-strand loop for all GCP variants. B-strands 
are highlighted in blue. c, Secondary structure prediction for the insertion 
domain of GCPS. The prediction confidence for a-helical secondary structure is 


colour-coded. d, Secondary structure prediction of xGCP6(546-794) showing 
highly a-helical character. e, Circular dichroism (CD) analysis of the purified 
xGCP6(546-794). Left, representative plots are from three biologically 
independent experiments. Comparison of the circular dichroism spectra for 
xGCP6(546-794) and BSA (containing only a-helices) confirms the predicted 
a-helical character of the N-terminal part of the GCP6 insertion domain. Right, 
Coomassie-blue-stained SDS-PAGE gel of the elution fraction of the 
xGCP6(546-794) shows the purity of the sample. Representative image is from 
three biologically independent experiments. 
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Extended Data Fig. 5 | Gallery of bulky amino acid side chains resolved in the y-TuRC density. GCP-variant-specific bulky amino acid side chains are marked by 
an asterisk. Combinations of such side chains in close proximity are marked by two asterisks. 
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Extended Data Fig. 6 | See next page for caption. 
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Extended Data Fig. 6 | Actinis a bona fide y-TuRC component involved in MT 
nucleation. a—d, High-confidence homology models were docked with 10,000 
randomly sampled starting positions and orientations into the envelope of the 
cryo-EM density. The number of fits is plotted against the respective 
correlation coefficients. a, y-Tubulin and GCP4 served as positive controls. 
True positive fits correspond to redundant high-frequency fits with high 
correlation coefficients (dashed box). b, Ovalbumin served as negative control. 
Noredundant high-frequency fits with high correlation coefficient indicative 
of atrue positive match were observed. c, Unbiased fitting of NME7, NEDD1and 
MOZARTIresulted ina distribution of correlation coefficients similar to the 
negative control, indicating no positive fit. d, Unbiased fitting of actin resulted 
ina distribution of correlation coefficients similar to the positive controls, 
clearly including a true positive fit (dashed box) shown in Fig. la. e, Purified 
y-TuRCs from X. laevis egg extracts were resolved by SDS-PAGE and 
immunoblotted with anti-y-tubulin, anti-GCP5 and anti-actin antibodies, which 
confirms that actin is associated with the purified y-TuRC fraction. 
Representative blots are from three biologically independent experiments. For 
gel source data, see Supplementary Fig. 1. f, g, Indirect immunofluorescence of 
adsorbed y-TuRC rings with antibodies directed against actin and GCP6 
indicates colocalization of both proteins. Treatment with 1% SDS increased the 
antibody accessibility of actin located in the spatially confined interior of the 


y-TuRC. f, Representative fluorescence images together with magnified views 
from three biologically independent experiments. Scale bars, 20 nm. 

g, Percentage of colocalization events of actin and GCP6 normalized to the 
GCPé6 signal. After treatment with 1% SDS, 40.2% of GCP6 signals colocalized 
with actin signals. In the absence of 1% SDS, colocalization events decreased to 
6.4% owing to inaccessibility of the epitope. Data are mean+s.d. fromn=3 
biologically independent experiments. Pvalues were determined by unpaired 
two-sided t-test. h, Actin polymerization activity was tested for buffer with and 
without pyrene F-actin and 0.5 nM ARP2/3, 0.5 nM ARP2/3 with 15 nM VCA and 
0.5nMy-TuRC with pyrene F-actin. The purified y-TuRC is devoid of actin 
nucleation activity. The fluorescence intensity change was determined over 
time. Data are the mean of five independent experiments. i, y-TuRC MT 
nucleation activity after pre-incubation with buffer, the actin-binding protein 
DNasel and the preformed actin—-DNasel complex. n=3 biologically 
independent experiments; data are mean+s.e.m. Pvalues were determined by 
unpaired two-sided ¢-test.j, After 3 h incubation onice, before the MT 
nucleation assay, samples were analysed by immunoblotting to confirm that 
equal amounts y-TuRC were present in different experimental groups. Three 
biologically independent experiments were performed with similar results. 
For gel source data, see Supplementary Fig. 1. 
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Extended Data Fig. 7| Co-immunoprecipitation of wild-type or 

mutant 3xFlag-GCP6 insertion domain fragments with the 3xMyc-tagged 
GCP2 and GCP5N termini. a, Co-immunoprecipitation of the GCP6 insertion 
domain (IDo; residues 606-1499), part 1 of the GCP6 insertion domain (IDo-P1; 
residues 606-1026), GCP6 insertion domain 9 repeats (IDo-9 repeats; residues 
1027-1268) and part 2 of the GCP6 insertion domain (IDo-P2; residues 1269- 
1499) (as defined in Fig. 3a) with GCP2-N and GCPS-N. Immunoblotting was 
performed with indicated antibodies. Representative result is from three 
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independent experiments. For gel source data, see Supplementary Fig. 1. b, Co- 
immunoprecipitation of the GCP6 insertion domain or part 1 of the GCP 
insertion domain with residues V644, F706, F732 and Q783 mutated to proline 
(GCP6-IDo* and GCP6-IDo-P1"", respectively) with both GCP2-N and GCP5-N. 
Immunoblotting was performed with indicated antibodies. Three biologically 
independent experiments were performed with similar results. For gel source 
data, see Supplementary Fig. 1. 
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Extended Data Fig. 8 | Sequence conservation of the GCP6 insertion domain the 4P mutant (V644P, F706P, F732P and Q783P) are indicated by asterisks. The 
region. The GCP6 insertion domains from human, bovine, mouse, chicken, X. individual repeats of the nine-repeat structure in human GCPé6 are marked by 
laevis, Anolis carolinensis and medaka (Oryzias latipes) were aligned with arrows. 

Clustal Omega (default settings) build in Jalview software*? **. The mutations in 


Active y-TuRC superimposed 


a 2 -TuRC Superimposed 
y-TuRC + CEP215N ¥ simatic to y-TuRC + CEP215N 
baa { 
b 
Tubulin 
ol 
Ss 
Vv 
55 — a 
—_— ~_—iy-tubulin 
3 
2 
2 55 — GST-CEP215N 
P 5 FSP cepzisnr™ 
= )—— 4 
+y-TuRC, +CEP215N‘75 = 
y-TuRC - + + + 
CEP215N - - + - 
CEP215NF* = ee 
d 
Buffer +CEP215N +CEP215NF754 p=0.0192 
fp; 
p=0.0208 
4 
e 
ad 
ra 
re 5 
5 & 
w £ 
T [> 
To 
i) 
uw 
all 
re 
re] 0 
5 
% Ran@69L - - -~ + * # 
CEP2i5N - + - - + - 
CEP215NF4 - = 0+ eet 


Extended Data Fig. 9 | See next page for caption. 
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Extended Data Fig. 9 | The CM1 motif of CEP215-N is not sufficient to activate 
the MT nucleation activity of y-TuRC. a, y-TuRC particles with or without a 
600-fold excess of CEP215-N were analysed by negative-staining electron 
microscopy. The resulting 3D densities (grey, yellow) were superimposed and 
compared toa simulated density of the y-TuRC in the extrapolated active 
conformation (red). Arrows indicate the most pronounced structural 
differences (transparent grey density not filled by red density) between the 
CEP215-N-y-TuRC complex and the simulated density of the y-TuRC in the 
extrapolated active conformation. b, y-TuRC (0.5 nM) was incubated with an 
excess of CEP215-N or CEP215(F75A)-N (3 pM). In vitro MT nucleation activity of 
the y-TuRC incubated with buffer, glutathione S-transferase (GST), GST- 
CEP215-N or GST-CEP215(F75A)-N. n=3 biologically independent experiments; 
data are mean +s.d. Pvalues were determined by unpaired two-sided f-test. 


Scale bar, 10 tm. c, In vitro binding of y-TuRC to purified and recombinant GST, 
CEP215-N and CEP215(F75A)-N. Immunoblots were probed with anti-y-tubulin 
and anti-GST antibodies as shown. Three biologically independent 
experiments were performed with similar results. For gel source data, see 
Supplementary Fig. 1.d, MT nucleation activity in egg extracts induced by the 
addition of Ran(Q69L) and CEP215-N. The MT nucleation reaction was stopped 
after 15 min at 20 °C when Ran(Q69L) addition only induced a small number of 
MTs. Fold changes of total aster fluorescence intensity from ten random fields 
were quantified and normalized to the group adding Ran(Q69L) and without 
CEP215-N or CEP215(F75A)-N. n=3 biologically independent experiments; data 
are mean +s.e.m. Pvalues were determined by unpaired two-sided t-test. Scale 
bars, 50 pm. 


Extended Data Table 1| Cryo-EM data collection, refinement and validation statistics 
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processing 
Magnification 
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Electron exposure (e—/A?) 
Defocus range (um) 
Pixel size (A) 
Symmetry imposed 
Initial particle images (no.) 
Final particle images (no.) 
Merged set of particles 
Map resolution (A) 

FSC threshold 
Map resolution range (A) 


Refinement 

Initial model used (PDB 

code) 

Model resolution (A) 
FSC threshold 

Model resolution range (A) 

Map sharpening B factor 
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Model composition 
Non-hydrogen atoms 
Protein residues 
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Protein 
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For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


x The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


x A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


[x | A description of all covariates tested 


x A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


x] A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
i AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


[x] For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


x For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 
x For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 
x Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection Datasets for cryo-EM were collected with SerialEM (3.7). Data for CD analysis were collected using JASCO spectra manager (1.06.00). 
Data for immunoblots were collected either by LAS4000IR (2.1) or Image Studio (5.2). Datasets for negative stain EM were collected with 
SerialEM (1.0.0.1). Software WoRx (6.1.1) was used to collect the data for immunofluorescence experiment. Data for in vitro nucleation 
assay and egg extract experiment were collected by VisiView software. Data for Actin polymerization assay were collected by CLARIOstar. 


Data analysis EM data were processed using Relion 3.0-Beta, MotionCorr 2.0 and gCTF 1.06. All density map related figure were prepared in Chimera 
1.13.1 and ChimeraxX 0.9. Atomic modelling was done in Coot 0.8.9.2 and Coot 0.9. Molecular Flexible fitting was performed in QwikMD 
implemented in VMD 1.9.4.a35 and in NAMD 2.13. Refinement and final flexible fitting of the model was performed in website tool 
NAMDinator, and in Phenix 1.14. Data for y-TuRC geometrical analysis of tubulins and for CD analysis were plotted in Gnuplot 5.2. 
Secondary structure predictions were done in website tool PSIPRED vs 3.0. Sequence alignment of GCP components was done in website 
tool PROMALS-3D. Analysis and vector visualisation of conformational change of y-TURC was performed in PyMol 2.2.3. Data for CD 
measurements were analyzed using JASCO spectra manager (1.06.00). MaxQuant software (1.6.2.6a) was used for LC-MS/MS analysis 
and database search. Fiji software (2.0.0-rc-46/1.50g) was used to to analyze the microtubule nucleation and immunofluorescence data, 
and western blots images. Microsoft Office Excel (2011) was used to normalize the data before statistical analyses. GraphPad Prism 6 
(6.01) was used for statistical analyses. Image Studio Lite (5.2.5) was used for western blot quantification in salt treatments. EMAN2 
Project Manager (eman2.12) was used for negative stain EM particle picking, an the particles were classified and averaged using the 
IMAGIC-4D package. The GCP6 IDo domain alignment was performed with Clustal Omega build in Jalview software (2101-VM). 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 
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Policy information about availability of data 


All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 
- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Cryo-EM densities of the y-TuRC filtered according to global or local resolution were deposited in the Electron Microscopy Data Bank (EMDB) under accession code 
EMD-10491. Atomic coordinates for the y-TuRC were deposited at the Protein Data Bank (PDB) under accession code 6TF9. The original immunoblots and further 
source data from LFQ mass spectrometry (Fig. 2a), immunoblot quantification (Fig. 3c), geometric analysis of the atomic model (Figs. 3g, 4b, 4c, 4d), MT nucleation 
assays (Extended Data Figs. 1c, 6i, 9b, 9d), CD measurements (Extended Data Fig. 4e), unbiased structure-guided identification (Extended Data Fig. 6a-d), 
quantification of indirect immunofluorescence (Extended Data Fig. 6g) and Actin polymerization (Extended Data Fig. 6h) are included in the Supplementary 
Information. The raw cryo-EM micrograph movie stacks are available from the corresponding authors upon request. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


[x | Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No statistical methods were used to predetermine sample size. Cryo-EM data were collected in a sufficient amount for reconstruction. For 
immunofluorescence and microtubule in vitro nucleation experiments, images were acquired in a number that is sufficient for statistic 
analysis (as specified in Methods). For negative stain EM, the images used for particle classification and average were enough to get clear 
views of the protein complexes. Sample size was chosen based on data variation. 


Data exclusions No data were excluded from the analyses. 


Replication All experiments, including the immunoprecipitation, mass spectrometry, and so on, were repeated at least three times to confirm the 
reproducibility. All replicates were successful. 


Randomization — For immunofluorescence, mass spec analysis and microtubule in vitro nucleation experiments, images/data were acquired randomly. Other 
experiments were not related to randomization. 


Blinding Blinding was not applied to our study. For a number of experiments this was technically not possible: cyro-EM, IP experiment, CD 


measurments. For other experiments this was technically not feasible because of the large sample size: e.g. MT nucleation assays. In other 
cases, a finding was confirmed by an independent approach: e.g. actin in the y-TuRC. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 
n/a | Involved in the study n/a | Involved in the study 
[x Antibodies x ChIP-seq 
[x Eukaryotic cell lines x Flow cytometry 
x Palaeontology x MRI-based neuroimaging 


[x Animals and other organisms 


x Human research participants 


*]||[_] Clinical data 


Antibodies 


Antibodies used Anti-y-tubulin rabbit polyclonal antibody, which was used for y-TuRC purification, was homemade against the C-terminal peptide 
as described in the previous publication: doi: 10.1038/ncomms9722, and diluted 1:2000 in immunoblot in CEP215N pull-down 
assay. Anti-y-tubulin mouse monoclonal antibody (GTU-88, T6557, Lot: 053K4839), which was used for immunoblotting, was 
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from Sigma-Aldrich and was diluted 1:4000. Rabbit anti-GCP2 polyclonal antibody (PA5-21433, Lot: QL2122721A) was from 
Thermo Fisher Scientific, and was diluted 1:1000 for immunoblotting. Rabbit anti-Xgrip109 (GCP3) polyclonal antibody, diluted 
1:2000 in immunoblotting, was from Dr. Y. Zheng, and was previously described in the publication: DOI: 10.1083/jcb.141.3.675. 
Rabbit anti-GCP6 polyclonal antibody, diluted 1:2000 in immunoblotting, was from Dr. Y. Zheng, and was described in the 
previous publication: DOI: 10.1083/jcb.151.7.1525, Guinea pig anti-GCP6 polyclonal antibody for immunofluorescence was 
generated as described in in the previous publication: DOI: 10.1083/jcb.151.7.1525, and was diluted 1:200. Anti-B-Actin mouse 
monoclonal antibody (AC-74, A5316, Lot: 048M4843V) used in immunofluorescence (diluted 1:1000) and anti-Actin rabbit 
polyclonal antibody (A2066, Lot: 058M4812V) used in immunoblot (diluted 1:200) were from Sigma-Aldrich. Anti-GCP4 rabbit 
polyclonal antibody were raised against full-length purified GCP4, and was diluted 1:200 in immunoblotting. Anti-GCP5 mouse 
monoclonal antibody (E-1) was from Santa Cruz Biotechnology (sc-365837, Lot: E2014), and was diluted 1:500 in 
immunoblotting. Mouse anti-FLAG monoclonal antibody (9A3, 8146S, Lot: 3, diluted 1:500 in immunoblotting) and rabbit anti- 
GAPDH polyclonal antibody (14C10, 2118S, Lot: 10, diluted 1:1000 in immunoblotting) were from Cell Signaling Technology. 
Mouse monoclonal anti-c-Myc antibody (clone 9E10, M4439, Lot: 087M4765V, diluted 1:1000 in immunoblotting) was from 
Sigma-Aldrich. Secondary antibodies used in this study were: Donkey anti-Mouse Alexa Fluor 488-conjugated antibody (A21202, 
Lot: 1562298, diluted 1:500 in immunofluorescence) and goat anti-guinea pig Alexa Fluor® 555-conjugated antibody (A21435, 
Lot: 1711692, diluted 1:500 in immunofluorescence) are from Thermo Fisher Scientific; peroxidase-conjugated goat anti-mouse 
antibody (115-035-068, Jackson ImmunoResearch Laboratories, diluted 1:5000 in immunoblotting); donkey anti-mouse DyLight 
680 (A10038, Lot: 1717043) and 800-conjugated antibodies (SA5-10172) are from Thermo Fisher Scientific, and both were 
diluted 1:5000 in immunoblotting; anti-rabbit DyLight 680-conjugated antibody (5366S, Lot: 3, Cell Signaling Technology, diluted 
1:5000 in immunoblotting); IRDye 800CW Donkey anti-Rabbit IgG (926-32213, Lot: C61012-02, LI-COR Biosciences, diluted 
1:5000 in immunoblotting). 
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Validation Homemade rabbit anti-y-tubulin rabbit polyclonal antibody and homemade rabbit anti-GCP4 rabbit polyclonal antibodies were 
validated previously in the publication: doi: 10.1038/ncomms9722 


Homemade rabbit anti-GCP3 rabbit polyclonal antibody was validated previously in the publication: DOI: 10.1083/jcb.141.3.675 
Homemade rabbit anti-GCP6 rabbit polyclonal antibody was validated previously in the publication: DOI: 10.1083/jcb.151.7.1525 


Homemade Guinea pig anti-GCP6 polyclonal antibody was generated with the same epitope as homemade rabbit anti-GCP6 
rabbit polyclonal antibody (DOI: 10.1083/jcb.151.7.1525), and was further validated by pilot experiments before using. 


All commercial antibodies were validated by manufacturers for the species and applications, which can be found in the links 
below. 


Anti-y-tubulin mouse monoclonal antibody (GTU-88): 
https://www.sigmaaldrich.com/catalog/product/sigma/t6557 ?lang=de&region=DE 

Rabbit anti-GCP2 polyclonal antibody: https://www.thermofisher.com/antibody/product/GCP2-Antibody-Polyclonal/PA5-21433 
Anti-B-Actin mouse monoclonal antibody (AC-74): 
https://www.sigmaaldrich.com/catalog/product/sigma/a2228?lang=de&region=DE 

Anti-Actin rabbit polyclonal antibody (A2066): 
https://www.sigmaaldrich.com/catalog/product/sigma/a2066?lang=de&region=DE 

Anti-GCP5 mouse monoclonal antibody (E-1): https://www.scbt.com/scbt/product/gcp5-antibody-e-1 

ouse anti-FLAG monoclonal antibody (9A3): 
https://en.cellsignal.de/products/primary-antibodies/dykddddk-tag-9a3-mouse-mab-binds-to-same-epitope-as-sigma-s-anti-flag- 
m2-antibody/8146 

Rabbit anti-GAPDH polyclonal antibody (14C10): 
https://en.cellsignal.de/products/primary-antibodies/gapdh-14c10-rabbit-mab/2118 

ouse monoclonal anti-c-Myc antibody (clone 9E10): 


https://www.sigmaaldrich.com/catalog/product/sigma/m4439?lang=de&region=DE &gclid=CjwKCAjwkqPrBRA3EiwAKdtwk- 
gHHmK21m-8cwvWs0O04fsjAXJMo3waZRhaAnZ8eq5IsEdpYzebNXBoC_qQQAvD_BwE 


Donkey anti-Mouse Alexa Fluor 488-conjugated antibody: https://www.thermofisher.com/antibody/product/Donkey-anti- 
ouse-lgG-H-L-Highly-Cross-Adsorbed-Secondary-Antibody-Polyclonal/A-21202 


Goat anti-guinea pig Alexa Fluor® 555-conjugated antibody: https://www.thermofisher.com/antibody/product/Goat-anti-Guinea- 
Pig-lgG-H-L-Highly-Cross-Adsorbed-Secondary-Antibody-Polyclonal/A-21435 
Peroxidase-conjugated goat anti-mouse antibody: https://www.jacksonimmuno.com/catalog/products/115-035-068 


Donkey anti-mouse DyLight 680-conjugated antibody: https://www.thermofisher.com/antibody/product/Donkey-anti-Mouse- 
gG-H-L-Highly-Cross-Adsorbed-Secondary-Antibody-Polyclonal/A10038 


Donkey anti-mouse DyLight 800-conjugated antibody: https://www.thermofisher.com/antibody/product/Donkey-anti-Mouse- 
gG-H-L-Cross-Adsorbed-Secondary-Antibody-Polyclonal/SA5-10172 


Anti-rabbit DyLight 680-conjugated antibody: https://en.cellsignal.de/products/secondary-antibodies/anti-rabbit-igg-h-l- 
dylight-680-conjugate/5366 


RDye 800CW Donkey anti-Rabbit IgG: https://www.licor.com/bio/reagents/irdye-800cw-donkey-anti-rabbit-igg-secondary- 
antibody 


Eukaryotic cell lines 


fe) 
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ree : : S 
Policy information about cell lines & 
N 
Cell line source(s) HEK293T cell line is as described (Panic, M, et. al. PLOS Genet. 11, 1005243 (2015). = 
Co 

Authentication We verified the HEK293T cell line according to the morphology by light microscopy. Cell line was purchased from ATCC. 


Mycoplasma contamination HEK293T cell line was negative in the mycoplasma contamination test. 


Commonly misidentified lines No commonly misidentified lines were used in this study. 
(See ICLAC register) 


Animals and other organisms 


Policy information about studies involving animals; ARRIVE guidelines recommended for reporting animal research 


Laboratory animals Xenopus laevis (lab bread NASCO, USA), female, sexually mature, i.e. more than 9 months old. 

Wild animals No wild animals were used in this study. 

Field-collected samples No field-collected samples were used in this study. 

Ethics oversight The permission numbers for Xenopus laevis experiment are:35-9185.81/G-204/12 (Regierungsprasidium Karlsruhe, BW) ; 


81-02.05.40.17.091 (LANUF Recklinghausen, NRW). Ethic justification of experiment is according to § 7a Abs. 2 Nr. 3 TierSchG. 


Note that full information on the approval of the study protocol must also be provided in the manuscript. 


=) 
jad) 
= 
e 
= 
o 
= 
o 
Ww 
© 
fev) 
x 
a 
> 
= 
O 
G 
Oo 
a 
5 
a 
a) 
= 
5 
=: 
fev) 
5 
S 


6) 


D 


8L0Z 4990} 


Article 


The structural basis for 
cohesin-—CTCF-anchored loops 


https://doi.org/10.1038/s41586-019-1910-z 


Received: 21 February 2019 


Accepted: 5 December 2019 


Published online: 6 January 2020 


Yan Li'®, Judith H. I. Haarhuis”*, Angela Sedefio Cacciatore”*, Roel Oldenkamp?, 
Marjon S. van Ruiten”, Laureen Willems, Hans Teunissen®, Kyle W. Muir'“*, Elzo de Wit**, 
Benjamin D. Rowland”* & Daniel Panne’** 


Cohesin catalyses the folding of the genome into loops that are anchored by CTCF’. 
The molecular mechanism of how cohesin and CTCF structure the 3D genome has 
remained unclear. Here we show that a segment within the CTCF N terminus interacts 
with the SA2-SCCI1 subunits of human cohesin. We report a crystal structure of SA2— 
SCC1in complex with CTCF at a resolution of 2.7 A, which reveals the molecular basis 
of the interaction. We demonstrate that this interaction is specifically required for 
CTCF-anchored loops and contributes to the positioning of cohesin at CTCF binding 
sites. A similar motif is present ina number of established and newly identified 
cohesin ligands, including the cohesin release factor WAPL”?. Our data suggest that 


CTCF enables the formation of chromatin loops by protecting cohesin against loop 
release. These results provide fundamental insights into the molecular mechanism 
that enables the dynamic regulation of chromatin folding by cohesin and CTCF. 


The interphase genome is folded in 3D through the concerted action 
of cohesin and CTCF. These architectural factors regulate the interac- 
tions between regulatory elements along chromosomes to control 
gene expression’*». Cohesin is thought to catalyse genome folding 
througha process knownas ‘loop extrusion’, which involves the forma- 
tion of chromosome loops that are progressively enlarged® ©. Genomic 
regions within which cohesin forms loops are also knownas topologi- 
cally associating domains (TADs), or loop domains. TADs are flanked 
by CTCF sites that are thought to act as barriers to the loop extrusion 
process"”, CTCF acts as such a boundary only when the 3’ ends of 
CTCF binding motifs are oriented towards the inside of the TAD?2, 
Consequently, only convergently oriented pairs of CTCF sites form 
CTCF-anchored loops’. 

This model is supported by genetic manipulation of cohesin and 
CTCF. Depletion of the core cohesin subunit SCCI1 leads to loss of 
TADs”””, By contrast, depletion of the cohesin release factor WAPL 
increases the size of chromatin loops’””””8. CTCF depletion leads to 
a marked loss of CTCF-anchored loops”. However, how CTCF can 
act as a directional boundary that controls cohesin loop extrusion 
remains unknown. 

Here we have investigated the mechanism of cohesin interaction 
with CTCF, and how this interaction contributes to genome organiza- 
tion. We have identified an N-terminal segment of CTCF that directly 
engages the SA2-SCC1 subcomplex of cohesin. Our crystal structure 
of the SA2-SCC1I-CTCF complex elucidates the molecular basis of 
the interaction. CTCF-anchored loops are abolished in mutants of key 
amino acids in the interface, but the accumulation of cohesin at CTCF 
binding sites across the genomeis only partially impaired. In addition 
to its function asa translocation barrier, CTCF thus possesses a distinct 
loop-stabilizing activity, which is realized through a direct interaction 


with cohesin. Furthermore, we observe intermolecular competition 
between CTCF and the cohesin release factor WAPL for this interface, 
which suggests a mechanism by which chromatin loop formation may 
be dynamically regulated. 


Structure of the SA2-SCC1-CTCF complex 


Previous data indicate that CTCF directly interacts with the SA2 subunit 
of the cohesin complex””°. To map this interaction, we produced a 
series of CTCF truncations as proteins fused to glutathione S-trans- 
ferase (GST), and performed pulldown assays against a complex of 
SA2 and SCCI1’. CTCF fragments that contained amino acids 227-235 
generally retained SA2-SCC1 on GST beads (Extended Data Fig. 1a, b). 
Isothermal calorimetry experiments further showed that the interac- 
tionis largely driven by amino acids 222-231 of CTCF, as the interaction 
involving this truncated CTCF retained an equilibrium dissociation 
constant (K,;=1.04 + 0.20 uM) comparable to that of an extended CTCF 
construct (K,= 0.62 + 0.07 1M) (Extended Data Fig. 1c, Extended Data 
Table 1a). To understand the molecular details, we produced crystals 
of the SA2-SCC1 complex in the presence of a peptide comprising 
the CTCF binding motif, and determined the structure by molecular 
replacement ata resolution of 2.7 A (Extended Data Table 1b). AnF,— F. 
omit electron density Fourier map exhibited clear features that cor- 
respond to the CTCF peptide (Extended Data Fig. 1d). 

The CTCF peptide is bound to the convex surface of SA2 (Fig. 1a, b). 
The CTCF binding surface is predominantly hydrophobic and com- 
posed of amino acids that are contributed by both SA2 and SCCI1. The 
lead ‘anchoring’ amino acids of CTCF, which bury the largest solvent- 
accessible surface area upon binding, are Y226 and F228 (Fig. 1b). F228 
inserts into a pocket comprising amino acids from SCC1 (S334, I337 and 
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*e-mail: kmuir@mrc-lmb.cam.ac.uk; e.d.wit@nki.nl; b.rowland@nki.nl; daniel.panne@le.ac.uk 


472 | Nature | Vol578 | 20 February 2020 


e CTCF f PE 
A es v Ese « 
FL WH SS oP 


MiB! BI BIB MiIBIBI BIB 


Increasing conservation score 


12345678 


12345 67 8 


Fig. 1| Structure of the SA2-SCC1-CTCF complex. a, Surface-rendered 
cartoon of the SA2-SCC1-CTCF complex, with components coloured in blue, 
green and magenta, respectively. C, C terminus; N, N terminus. b, Detailed view 
of the binding interface with SA2 residues in blue, SCClin green and CTCF in 
magenta. c,d, Details of the composite binding pocket around CTCF F228 (c) 
and CTCF Y226 (d). e, f, GST pulldown analysis of CTCF (e) and SA2 or SCC1 
variants (f).B, bound fraction; I, input; M, molecular weight marker. Controls 
are shown ine (lanes 1 and 2). Experiments were done once. g, SA2 is surface- 
rendered and coloured according to sequence conservation. 


L341) and SA2 (Y297 and W334) (Extended Data Fig. le). The hydroxyl 
group of Y226 hydrogen-bonds with D326 of SA2 ina deep hydrophobic 
pocket lined by L329, L366 and F367 (Fig. 1d). E229 and E230 of CTCF 
constitute secondary anchoring residues, which presumably contrib- 
ute to binding specificity by forming salt bridges with R298 of SA2 and 
R338 of SCC1 (Fig. 1c). As CTCF engages a composite binding surface 
containing amino acids from SCC1and SA2, previous mapping studies 
that used isolated SA2 may have been misleading”. 


Analysis of the CTCF binding interface 

Mutagenesis of Y226A or F228A in CTCF abolished SA2-SCC1 binding 
ina GST pulldown assay (Fig. le). Likewise, the substitution of criti- 
cal amino acid residues—including W334<A, F371A or F367A in SA2 or 
1337A/L341A in SCC1—abolished CTCF binding (Fig. 1f). SA2 contains an 
86-amino-acid motif termed the ‘stromalin conservative domain” or 
‘conserved essential surface’ (CES)*”’, whichis conserved from fungi to 
mammals and coincides with the CTCF binding pocket. For simplicity, 
we refer to the composite SA2-SCC1 binding pocket as the CES. Map- 
ping of sequence conservation onto the structure confirms that the CES 
is highly conserved (Fig. 1g, Extended Data Fig. 2a). Aseries of missense 
mutations are found in SA2 (also known as STAG2), SCCI (also known as 
RAD21) and CTCF in various types of cancer”. The mapping of mutation 
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Fig. 2| CTCF interaction stabilizes cohesin on DNA. a, Schematic of 
competition between CTCF and WAPL. b, Increasing amounts of WAPL residues 
1-600 (WAPL(1-600)) (lane 4-7; molar ratios are indicated) were incubated 
with GST-CTCF and SA2-SCC1and the bound fraction was analysed. Three 
independent experiments were done, with consistent results. A representative 
example is shown. c, Example images of cells used ind at the indicated time 
points after photobleaching. FRAP was performed in Gl cells (Extended Data 
Fig. 3d). d, Quantification of the FRAP experiments. Averages and standard 
deviations for 21 wild-type cells and 17 CTCF'?"6"4 cells, measured over 
3 independent experiments. 


frequencies onto the structure shows that amino acids that are largely 
buried in the interface are hotspots in cancer (Extended Data Fig. 2b). 

Previous data indicate that the SA2-SCC1 complex interacts with 
multiple cohesin regulators””*’. This includes two factors with oppos- 
ing functions: WAPL, the general cohesin release factor, and shugoshin 
(SGO1), a factor that is crucial for the protection of centromeric cohe- 
sion during mitosis””*”®, This antagonism arises as a result of direct 
competition for binding to the CES of SA2-SCC1”. As mutants reported 
to interfere with both SGO1 and WAPL binding cluster in the CES, we 
investigated whether these proteins bind to SA2-SCC1 by amechanism 
comparable to that of CTCF. InSGO1, the reported CES-binding domain 
(amino acids 313-353) contains a conserved FGF-like motif that strongly 
resembles that of the CTCF peptide. Vertebrate WAPL also contains 
several FGF motifs in its N-terminal region that are potentially involved 
incohesin regulation*?”’. A minimal fragment of WAPL capable of com- 
peting with SGO1 for access to the CES (amino acids 410-590) contains 
two such FGF motifs’. We observed that a peptide that spans the second 
and third FGF motif of WAPL (amino acids 423-463) bound to SA2-SCC1 
with a K, of about 32.8 pM (Extended Data Fig. 2c), whereas a peptide 
that comprises only the third motif bound more weakly (Extended Data 
Table 1a). The peptide containing the CES motif of CTCF therefore binds 
with higher affinity than do peptides that contain the WAPL motif(s). 


CTCF stabilizes cohesin on chromatin 


The observation that CTCF and WAPL can bind to the same surface on 
SA2-SCC1 raises the possibility that their interaction with the CES is 
mutually exclusive (Fig. 2a). To determine whether WAPL competes 
with CTCF for binding to the CES of SA2-SCC1, we performed GST- 
pulldown competition assays. Titration of WAPL residues 1-600 against 
a preformed complex of GST-CTCF and SA2-SCC1 depleted the latter 
from the beads (Fig. 2b). Similarly, titration of a peptide of SGO1 phos- 
phorylated at T346—which has previously been reported to preclude 
WAPL binding?—also displaced SA2-SCC1 from GST-CTCF (Extended 
Data Fig. 2d). Hence, the CES of SA2-SCC1is a general interaction hub 
for multiple regulators of cohesin (Extended Data Fig. 2e). Whereas 
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Fig. 3 | CTCF-CES interaction is required for CTCF-anchored loops. a, Hi-C 

contact matrices of the HOXA locus at 10-kb resolution, normalized to 

100 million contacts per sample. Genes and CTCF sites are depicted above the 

contact matrices. b, Genome-wide quantification of loops using HICCUPS». 

The inset shows an example of called loops for aregion of chromosome 16. 

c, Aggregate peak analysis for the loops defined genome-wide in wild-type 

cells. The Hi-C signal is averaged across these locations for both cell lines. 


SGO1 precludes WAPL binding (thus stabilizing centromeric cohesinin 
mitosis), CTCF could exert a similar function at CTCF sites in interphase. 

To test whether the CTCF-CES interaction stabilizes cohesin on 
chromatin, we mutated the endogenous allele of CTCF in the human 
haploid HAPI cell line using CRISPR-Cas9 technology. We thereby 
obtained HAP1 cells that contained the CTCF" mutation as their 
sole copy of CTCF (Extended Data Fig. 3a, b). These cells displayed no 
obvious proliferation defects. To study the consequences of the CTCF 
mutations on cohesin turnover on chromatin, we endogenously tagged 
the core cohesin subunit SCC1 with a Halo tag in both wild-type and 
CTCF'776*2284 cells (Extended Data Fig. 3c), and performed fluorescence 
recovery after photobleaching (FRAP) experiments. In wild-type cells, 
we found that—over a period of 20 min—a fraction of the fluorescent 
cohesin population did not recover. However, in CTCF’ cells we 
observed a near-complete recovery by FRAP, which demonstrates that 
cohesin is more mobile in these cells (Fig. 2c, d). The CTCF-CES inter- 
action therefore stabilizes a subpopulation of cohesin on chromatin. 


Loops require CTCF-CES binding 
Toinvestigate the role of the cohesin-CTCF interaction in chromosome 
organization, we generated chromosome conformation capture (Hi- 
C) profiles of wild-type and CTCF'?““2% cells. Wild-type HAPI cells 
displayed clear loops connecting CTCF sites (Fig. 3a, Extended Data 
Fig. 4), however, Hi-C matrices of CTCF’ cells revealed a robust 
ablation of CTCF-anchored loops (Fig. 3a). By systematically scoring 
the number of loops, we found that in the CTCF'?""4 mutant the vast 
majority of detectable loops across the genome were lost (from 2,756 
in the wild-type cells, to 98 inthe mutant cells) (Fig. 3b). An aggregate 
peak analysis, which quantifies the contact frequency of all the loops 
identified in wild-type cells, likewise showed a marked loss of these 
contacts (Fig. 3c, Extended Data Fig. 4d). 

CTCF sites not only lie at the bases of CTCF-anchored loops, but also 
form the boundaries of TADs (Extended Data Fig. 4a). We then assessed 
the effect of the CTCF'“"*4 mutation on TADs, and found that these 
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Fig. 4 | CTCF-CES interaction promotes localization of cohesinto CTCF sites. 
a, Hi-C contact matrix of aregion of chromosome 7 at 10-kb resolution. CTCF 
sites are depicted below; those selected for qPCR are shown in colour (forward 
motifs in red, reverse motifs in blue). The numbers underneath indicate the loci 
used for qPCRs inb. Locus 6 (indicated with *) isthe HOTA/RM1 transcription 
start site. b, ChIP-qPCR analysis of SCC1 (cohesin) at the loci depicted ina. The 
mean of three independent ChIP experiments is shown with standard 
deviations. c, ChIP-seq tracks for SCC1 of the same region of chromosome 7 as 
is depicted by Hi-C ina. The ChIP-qPCR loci of b are depicted below. d, ChIP- 
seq heat map of the cohesin subunit SCC1 (left) and CTCF (right). The depicted 
sites are selected for being bound in wild-type cells by both SCC1l and CTCF 
(top), or only by SCC1 (bottom). e, Cohesin-mediated looping initiates at distal 
sites until encounter of the N-terminal end of CTCF. f, Cohesin-mediated 
looping starts at CTCF sites. g, Molecular model of CTCF and SA2-SCC1 bound 
to DNA (grey). The YXF motif is separated by a flexible linker spanning residues 
232-267 (magenta dotted line) to the C-terminal DNA-binding domain of CTCF. 


structures were—to a considerable degree—still present in CTCF" 
cells but that they have less-clear edges (Fig. 3a). Aggregate TAD 
analysis further confirmed that TAD-like structures do exist in 
CTCF?76“/2254 cells, but that these structures have less-clear boundaries 
(Extended Data Fig. 4b, c, e, f) and completely lack CTCF loops at their 
corners (Extended Data Fig. 4b, e). Our results therefore support the 
notion that in CTCF'24""284 cells cohesin can form the loops along 
DNA that make up the contacts within TADs, but that cohesin is not 
stabilized at CTCF sites to allow for the formation or maintenance of 
CTCF-anchored loops. 


Cohesin localization to CTCF sites 


To assess whether the CTCF-CES interaction affects cohesin abundance 
at loop anchors, we performed chromatin immunoprecipitation with 
quantitative PCR (ChIP-qPCR) experiments. We selected CTCF sites 
at the base of loops (Fig. 4a, Extended Data Fig. 5a) and found that in 
the CTCF'76""?254 mutant the abundance of cohesin was reduced at 
the majority of these loci. By contrast, cohesin levels at anearby locus 


that did not contain a CTCF site were not affected (Fig. 4b, Extended 
Data Fig. 5b). CTCF binding to the corresponding CTCF sites was also 
largely unaffected (Extended Data Fig. 5c-e). We then assessed cohesin 
distribution genome-wide by chromatin immunoprecipitation with 
sequencing (ChIP-seq) and found that the CTCF’ mutation 
decreased cohesin localization to CTCF sites, but had little-to-no effect 
oncohesin localization at unrelated sites (Fig. 4c, d). Although cohesin 
levels at CTCF sites were reduced in CTCF" cells, cohesin was—to 
aconsiderable degree—still present at CTCF sites. Our data therefore 
support a model in which CTCF influences cohesin in two ways: (i) it 
halts cohesin at CTCF sites and (ii) it stabilizes cohesin at the base of 
CTCF-anchored loops. The former function could be important for 
defining TAD boundaries. The binding of CTCF to the CES of cohesin 
could affect the latter function and may thereby prevent the disruption 
of CTCF-anchored loops. 

To evaluate the consequences of the loss of CTCF-anchored loops on 
gene expression, we performed RNA-sequencing analyses. The CTC- 
F7264/F228A mutation affected the expression of more than 2,000 genes. 
Although the number of genes that were upregulated was comparable 
to the number of genes that were downregulated, the most strongly 
affected genes were more frequently downregulated (Extended Data 
Fig. 7a). Thus, the interface of CTCF formed by Y226 and F228 and (by 
extension) cohesin-CTCF anchored loops are apparently key to cor- 
rect expression of these genes. Despite this effect on gene expression 
and the loss of virtually all CTCF-anchored loops, cells that contain 
only this mutant form of CTCF are viable. CTCF has previously been 
shown to be essential for viability of mouse embryonic stem cells". 
We therefore tested whether CTCF is essential for the viability of HAP1 
cells, and found that CTCF depletion mediated by short interfering 
RNA was lethal to both control HAP1 cells and CTCF’? mutant 
cells (Extended Data Fig. 7b, c). Thus, CTCF has essential roles that 
are apparently independent of CES engagement and the formation of 
CTCF-anchored loops in these cells. 


Identification of CES ligands 


To investigate the prevalence of the CES-binding factors, we com- 
piled an alignment of known cohesin partners and derived a regular 
expression motif (Extended Data Fig. 2e, f). We used this motif to query 
the human and budding yeast proteomes for proteins that contain 
similar binding motifs*°. From the set of nuclear proteins that arose 
from this search, we were able to identify known cohesin regulators 
as well as several additional potential binding factors. We generated 
peptide arrays that bear these sequences and assayed the binding of 
SA2-SCC1, using an SA2(F371A)-SCC1 mutant complex as a negative 
control. We observed clear signal for the CTCF peptide that spans 
amino acids 222-231, which was abolished in the SA2(F371A)-SCC1 
mutant (Extended Data Fig. 7d, e). ACTCF(Y226F) mutant showed 
approximately 1.5-fold reduced binding, apparently due to loss of the 
hydrogen bond between the hydroxyl group of CTCF Y226 and D326 
of SA2 (Extended Data Table 2). Consistent with our pulldowns, the 
CTCF(Y226A), CTCF(F228A) and CTCF(Y226A/F228A) peptide variants 
did not retain SA2-SCC1. The WAPL peptides showed considerably 
weaker binding as compared to CTCF, and we could not detect binding 
for ligands such as SGO1 (Extended Data Table 1a, Methods). Robust 
binding was observed for MCM3 (asubunit of the replicative helicase), 
SYCP3 (a component of the synaptonemal complex), ZGPAT (a tran- 
scriptional repressor) and CENPU (a subunit of the inner kinetochore). 
Thus, the CES of SA2-SCCI potentially facilitates cohesin regulation 
for anumber of functionally divergent chromosomal processes. 


Discussion 


Our study reveals that CTCF binds toa CES onthe SA2-SCC1 subcomplex 
of cohesin. The ablation of this interaction results ina near-complete 


loss of CTCF-cohesin anchored loops. Thus, CTCF does not simply 
presenta passive barrier to cohesin-mediated loop extrusion, but spe- 
cifically interacts with the CES to stabilize cohesin at these loci and to 
prevent loop disruption. Accordingly, impairment of the CTCF—CES 
interaction renders cohesin more dynamic (Fig. 2c, d). 

SA2 and SCC1, as well as CTCF, are frequently mutated in anumber 
of tumour types” and the mutations cluster in the CES (Extended Data 
Fig. 2b). Therefore, the dysregulation of chromatin looping may be 
causally related to carcinogenesis”. 

We envisage two possible scenarios for the formation of CTCF- 
anchored chromatin loops. Inthe first model (Fig. 4e), cohesin initiates 
loop enlargement at distal chromatin loci. These cohesin complexes 
remain dynamic because the cohesin release factor WAPL directly binds 
to cohesin by engaging the CES*?”” and PDS5”7”*, and promotes the 
opening of cohesin rings at the SMC3-SCC1 interface» ”. Alternatively, 
loop enlargement commences at CTCF sites**. Cohesin then catalyses 
DNA looping at these sites because CTCF counteracts DNA release 
(Fig. 4f). These models are not necessarily mutually exclusive, as a 
cohesin complex that initiates looping in the former mode may well 
be converted into the latter upon encountering CTCF. As CTCF directly 
competes with WAPL for binding to the CES (Fig. 2a, b), we suggest that 
this interaction stabilizes chromatin loops. 

We propose a model for how cohesin and CTCF co-associate on DNA 
(Fig. 4g). Our model indicates that cohesin engages CTCF only when 
approaching the N terminus of CTCF. Specifically, the 34-amino-acid 
flexible linker that connects the YXF motif to the first DNA-binding 
zinc finger of human CTCF is sufficiently long to allow SA2-SCC1 DNA 
binding towards theN, but not the C, terminus of CTCF (Fig. 4g), thus 
confirming previous mapping studies”. Stabilization of cohesin by 
engagement of the CTCF N terminus may explain why TAD bounda- 
ries arise preferentially when CTCF binding sites are convergently ori- 
ented?? 16949 If an individual cohesin complex anchors itself at the N 
terminus of CTCF, and then reels in DNA until it encounters a cohesin 
that is likewise reeling from the opposite CTCF site, this would bring 
together CTCF sites**. Loop formation by the related Saccharomyces 
cerevisiae condensin complex appears to involve a DNA anchoring 
function of its HEAT-repeat subunit Ycg1, a paralogue of human SA2 
and Saccharomyces cerevisiae Scc3" *. These different complexes may 
therefore use a similar anchoring principle to build loops and provide 
structure to genomes. As the CES interface is conserved between iso- 
forms of SA, we anticipate that ligand binding will affect all cohesin 
variants ina similar manner. Similarly, this interface is also conserved 
through Scc3 in fungi, despite the absence of CTCF in these organ- 
isms. The CES therefore is likely to represent an ancient interaction 
hub on cohesin. 

The observation that CTCF-CES interaction controls DNA looping 
indicates that this aspect of cohesin function can be regulated by an 
F/YXF motif containing cohesin ligands. A number of other genome 
regulatory factors contain F/YXF motifs—including SGO1 (Extended 
Data Fig. 7d, e), which protects centromeric chromatid cohesion by 
antagonizing WAPL binding to the CES of SA2-SCCV’. We therefore 
predict that anumber of proteins that contain F/YXF motifs engage the 
CES and thereby modulate the ability of cohesin to catalyse genome 
folding in functionally divergent chromosomal processes. 
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Methods 


No statistical methods were used to predetermine sample size. The 
experiments were not randomized and investigators were not blinded 
to allocation during experiments and outcome assessment. 


Constructs, protein expression and purification 

The human SA2 fragment amino acids 80-1060, cloned into pGEX- 
6P and codon-optimized for expression in Escherichia coli, was 
obtained from H. Yu. The construct encodes an N-terminal GST tag 
and C-terminal SA2 separated by a PreScission protease cleavage site. 
A plasmid encoding SCCI1 was obtained from J.-M. Peters. SA2 was 
co-expressed with an N-terminally 6xHis-tagged fragment of SCC1 
spanning residues 281-420 cloned into the Ncol-Notl sites of a pACYC- 
Duet-1 vector (Merck Millipore). CTCF constructs were cloned into the 
BamHl and Notl sites of pGEX-6P1. Mutagenesis was performed using a 
QuikChange Lightning site-directed mutagenesis kit (Agilent). All the 
proteins were expressed in £. coli BL21(DE3) by autoinduction*. Cells 
were grown at 37 °C until an optical density at 600 nm (ODgo9 nm) = 0.6 
and then shifted to 18 °C for 16 h. Cells were collected with a JLA-8.1 
rotor (Beckman) and washed once with ice-cold PBS buffer. Pellets 
were resuspended in buffer 1 (40 mM TRIS, pH 7.5, 500 mM NaCl and 
0.5 mM TCEP), lysed using a microfluidizer (Microfluidics) and 
centrifuged at 4 °C for 1h at 15,000 rpm using JA-20/14 rotors 
(Beckman). 

The GST- and His-tagged SA2-SCC1 complex was applied to Co”* 
conjugated IMAC sepharose resin (GE Healthcare) using a Minipuls3 
peristaltic pump (Gilson), washed with buffer 1 supplemented with 
20 mM imidazole and eluted using buffer 1 supplemented with 300 
mM imidazole. Co” eluate was then bound to Glutathione Sepharose 
4 Fast Flow resin (GE Healthcare) using a Minipuls3 peristaltic pump 
(Gilson), washed with buffer 1 and eluted by adding 10 mM reduced 
L-Glutathione (Sigma-Aldrich) into buffer 1. The GST tag was cleaved by 
PreScission protease (EMBL core facilities) during overnight incubation 
at 4 °C. Cleaved protein was concentrated using an Amicon Ultra -15 
concentrator (Millipore) and applied to aMonoQ 5/50 GL column (GE 
Healthcare) in buffer 2 (40 mM TRIS, pH7.5, 150 mM NaCl and 0.5mM 
TCEP) and eluted via a linear gradient of buffer 2 containing 1 M NaCl 
and further purification using a HiLoad 16/60 Superdex 200 prep-grade 
column (GE Healthcare) in buffer 3 (20 mM TRIS, pH7.7, 300 mM NaCl 
and 5mM TCEP). The final purified proteins were concentrated using 
an Amicon Ultra -15 concentrator (Millipore) and flash-frozen in liquid 
N, for storage at —80 °C. 


Crystallization and structure determination 

Crystals of SA2(80-1060) in complex with SCC1 amino acids 281-420 
(otherwise denoted the SA2-SCC1 complex) were grown by hanging- 
drop vapour diffusion at 20 °C by mixing equal volumes of protein at 
8 mg ml and crystallization solution containing 0.06 M Morpheus 
Divalents mix, 0.1 M Morpheus buffer system 2, 48% (v/v) Morpheus 
EOD_P8K (Molecular Dimensions). Crystals were soaked for 24-48 h 
with a peptide (obtained from peptid.de) including amino acid resi- 
dues 222-231 of CTCF (Uniprot ID Q8NI51; DVSVYDFEEE). Crystals 
were cryo-protected by adding 15% glycerol to the well solution and 
flash-frozen in liquid nitrogen. 

Diffraction data for all crystals were collected at 100 K at an X-ray 
wavelength of 0.966 A at beamline ID30A-1/MASSIF-1* of the Euro- 
pean Synchrotron Radiation Facility, with a Pilatus3 2M detector, using 
automatic protocols for the location and optimal centring of crystals“. 
The beam diameter was selected automatically to match the crystal 
volume of highest homogeneous quality’’. Data were processed with 
XDS* and imported into CCP4 format using AIMLESS”. 

The structure was determined by molecular replacement using 
Phaser™. A final model was produced by iterative rounds of man- 
ual model-building in Coot™ and refinement using PHENIX™. The 


CTCF-containing model was refined to a resolution of 2.7 A with an 
Ryo ANd an Ryroe Of 25% and 27%, respectively (Extended Data Table 1b). 
Analysis by MolProbity® showed that there are no residues in disallowed 
regions of the Ramachandran plot and the all-atom clash score was 7.2. 
The model shown in Fig. 4f was generated by superposition on DNA 
of SA2-SCC1-CTCF (RCSB Protein Data Bank code (PDB) 6QNX) with 
DNA-bound Saccharomyces cerevisiae SCC3-SCC1 (PDB 6H8Q)* and 
a composite model of DNA-bound CTCF zinc fingers assembled from 
PDB SYEF and PDB SYEL™. 


GST pulldowns and peptide arrays 

For GST pulldowns, 10 pM GST-tagged CTCF constructs and 2.5 uM 
SA2-SCCI1 were mixed in 50 pl buffer 4 (20 mM TRIS, pH7.7, 300 mM 
NaCl and 0.5mM TCEP) + 0.1% Tween-20 containing 25 pl of aS50% slurry 
of GST sepharose beads per reaction. For WAPL and SGO1 competition 
assays, 2.5 uM GST-tagged CTCF(86-267) was incubated with 1 uM of 
SA2-SCC1 and increasing concentrations of WAPL(1-600) or aSGO1 
phosphorylated at T346 peptide spanning amino acids 331-349 (molar 
ratios are indicated in each figure), under reaction conditions that were 
otherwise identical to GST pulldowns. Reactions were incubated at for 
Ihat 4 °C. Twenty-five microlitres of the reaction were withdrawn as 
the reaction input and the remainder was washed 5 times with 500 ul 
of buffer 4 + 0.1% Tween-20. Samples were boiled in 1x SDS sample 
loading buffer (NEB) for 5 min to obtain the bound fraction, followed 
by SDS-PAGE analysis. 

Isothermal calorimetry (ITC) was performed using a MicroCal iTC 
200 (Malvern Panalytical) at 25 °C. SA2-SCC1 and the CTCF, SGO1 and 
WAPL peptide ligands were dialysed overnight at 4 °C against 20 mM 
TRIS, pH7.7, 150 mM NaCl, 0.5 mM TCEP. For each titration, 300 ul of 
50 uM SA2-SCCI1 was added to the calorimeter cell. The concentration 
of peptides was adjusted to S00 uM and injected into the sample cell 
as 16x 2.5-pl syringe fractions. Results were analysed and displayed 
using the Origin 7.0 software package supplied with the instrument. 
Data were analysed using the one-site binding model. 

Peptide arrays, with an area of 3 cm’, were obtained fromR. Volkmer 
(http://immunologie.charite.de). Arrays were washed with 100% etha- 
nol for 5 min ona shaker at 21 °C, followed by 3 washes, for a total of 
10 min in TBS-T buffer (SO mM Tris pH7.5, 150 mM NaCl and 0.05% 
Tween-20). For the blocking step, arrays were incubated in 1x block- 
ing buffer (Sigma B6429) for 3 hat 21 °C, followed by 3 washes in TBS-T 
for a total of 10 min. SA2-SCC1 and SA2(F371A)-SCCI1 were added to 
1x blocking buffer at a final concentration of 1.2 1M and incubated 
with the array overnight at 4 °C under gentle agitation. The membrane 
was washed 3 times (1x 30 s, and then 2x 5 min) at 21 °C. The anti 6x 
anti-poly His—HRP antibody (Sigma A-7058) was diluted 1:2,000 in 
1x blocking buffer and incubated with the arrays for 1h at 21°C. The 
array was washed 3 times (1x 30s, and then 2x 5 min) and developed by 
addition of 3,3’-diaminobenzidine (Sigma D4293) for 1 min followed by 
quenching in deionized H,O. To measure non-specific binding of the 
anti-6xHis antibody, all steps were identical except that no SA2-SCC1 
protein solution was added to 1x blocking buffer during the overnight- 
incubation step. Arrays were imaged with a BioRad Gel Doc XR+ docu- 
mentation system. Spot intensities were measured using ImageJ 1.52k. 
Three independent experiments were done and the apparent disso- 
ciation constants determined by normalization with ITC data from 
CTCF(222-231) (Extended Data Table la). 


Genome editing and cell culture 

Cells were cultured in Iscove’s modified Dulbecco’s medium sup- 
plemented with 10% FCS (Clontech), 1% penicillin-streptomycin 
(Invitrogen) and 1% UltraGlutamin (Lonza). The guide (g)RNA target- 
ing exon 1 of CTCF was designed and annealed into pX330 (primer, 
5’- CGATTTTGAGGAAGAACAGC-3’). To modify the targeted locus, 
we cotransfected a 120-base-pair repair oligonucleotide containing 
the desired mutation and a silent mutation (repair oligonucleotide: 
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5’-CCAAAAAGAGCAAACTGCGT TATACAGAGGAGGGCAAAGATGTAGAT- 
GTGTCTGTCGCCGATGCTGAAGAAGAACAGCAGGAGGGTCTGCTATCA- 
GAGGTTAATGCAGAGAAAGTGGTTG-3’). pBabePuro was cotransfected 
inal0:1ratiotothe pX330. Transfected clones wereselected using 2p1¢/pl 
puromycin for 2 days. Colonies were picked when they were clearly 
visible, gDNA of clones was isolated and mutations were validated by 
Sanger sequencing. 

To target the C terminus of SCC1, a gRNA (primer: 5’-CCAAGGTTC- 
CATATTATATA-3’) was cloned into px459 V2.0 (Addgene plasmid no 
62988). The SCC1-Halo tag HR template was a gift fromJ. Rhodes». 
SCC1-Halo cell lines were generated by cotransfection of pX459 and 
the SCC1-Halo tag HR vector using FUGENE HD Transfection Reagent. 
Cells were selected with puromycin (2 ug/ml) for 2 days. Colonies were 
picked when they were clearly visible and validated using western blot 
analysis and immunofluorescence. 


Antibodies 

The following antibodies were used for western blots: SMC1 (A300- 
055A, Bethyl), CTCF (07-729, Millipore and ab128873, Abcam), HSP90 
(F-8, Santa Cruz), SCC1 (05-908, Millipore), tubulin (T5168, Sigma) and 
H4 (05-858, Millipore). All primary antibodies were used at a 1:1,000 
dilution, with the exception of HSP90 and tubulin (1:10,000). Second- 
ary antibodies for western blot analysis were used ina 1:2,000 dilution: 
goat anti-rabbit-PO and goat anti-mouse-PO (DAKO). For ChIP-seq, we 
used the following antibodies: SCC1 (ab992, Abcam), CTCF (3418S, Cell 
Signaling) and IgG (15006, Sigma-Aldrich). 


FRAP 

Cells were grown on LabTeklIl-chambered cover glass (Thermo Sci- 
entific Nunc). Two days before imaging, cells were transfected with 
DNA helicase B fragment fused with near-infrared fluorescent protein 
(DHB-iRFP) using FUGENE HD Transfection Reagent. Before imaging, 
cells were incubated with 300 nM fluorescent Halo tag ligand JF585 for 
30 min. Cells were washed 3 times with normal medium and incubated 
for 1hto allow exit of excess of ligand. Medium was replaced twice more 
with prewarmed Leibovitz L-15 medium (Invitrogen). Live-cell imaging 
was performed on a Leica SP5 confocal microscope with a 63x 1.2 NA 
water objective using the LAS-AF FRAP-Wizard. Before bleaching, five 
images were taken. Half of the nucleus of G1 cells was photobleached 
using 6 pulses of 100% transmission of a 561-nm laser. Subsequently, 
600 frames were taken every 2s. Fluorescence intensity was measured 
in the bleached and unbleached area by user-defined regions using 
Image] v.1.52q, and adjusted by hand for nucleus movement. Measure- 
ments were corrected for photobleaching by monitoring anonbleached 
cell. Recovery was quantified by calculating the difference in intensity 
inthe bleached and unbleached regions after background correction. 
Nondiffusive SCC1-Halo (Extended Data Fig. 3f) was quantified by the 
relative loss in fluorescence intensity in the unbleached region between 
the first frame postbleaching and five frames prebleaching. 


Colony-formation assay 

Cells were seeded at equal density and transfected with short interfer- 
ing (si)RNAs targeting either no oligonucleotide, luciferase, CTCF or 
SMCIA. All siRNAs were ON-TARGETplus SMARTpools manufactured 
by Dharmacon. Transfection was repeated after 3 days, and after an 
additional 4 days samples were fixed for 10 min with 96% methanol and 
stained with 0.25% crystal violet. Cells treated by the same protocol 
were collected for western blot analysis; samples were collected two 
days before fixation to have enough cells for western blot analysis. 


Chromatin fractionation 

For the chromatin fractionation experiment shown in Extended Data 
Fig. 3e, 50 million cells per cell line were collected and fractionation 
was performed using Subcellular Protein Fractionation Kit for Cultured 
Cells (78840, Thermo Fisher Scientific) according to the manufacturer’s 


protocol, with minor changes. The pellet was washed twice after cen- 
trifugation. Western blots were performed as previously described”®. 


Hi-C 

Samples for Hi-C were prepared as previously described’. Raw 
sequence data were mapped and processed using HiC-Pro v.2.9° with 
hg19 as reference. Statistics on the number of valid pairs and percent- 
age of cis contacts are summarized in Extended Data Table 3b, c. Repli- 
cates land 2are highly similar, with a reproducibility >0.98 as assessed 
by HiCRep v.1.8.0°, and were subsequently combined into one Hi-C 
dataset. The valid pair files generated by HiC-Pro were used to cre- 
ate juicebox ready files using juicebox-pre (juicer tools v.0.7.5)°°. For 
visualization, contact matrices were ICE-normalized® and counts were 
normalized for 100 million contacts per sample. 

Loops were then called with HICCUPS v.1.11.09% at 5-kb, 10-kb and 
25-kb resolution. To visualize the genome-wide effect of the introduced 
CTCF mutations in loops, we performed aggregate peak analysis» as 
implemented in GENOVA v.0.9.8 (https://github.com/robinweide/ 
GENOVA), using loops that had previously been defined in wild-type 
HAP cells”®. In brief, for a set of loop coordinates a square submatrix is 
selected such that it is centred on the corresponding coordinates, with 
a100-kb flanking region upstream and downstream. These submatrices 
are then averaged to obtain a mean contact map for these locations. 

Similar to the aggregated peak analysis, aggregate TAD analysis was 
done to visualize how TAD structures are affected by the CTCF muta- 
tions. For this analysis, we used TADs that had previously been defined 
for wild-type HAP1 cells”. In brief, these TADs were called using HiCseg® 
on10-kb matrices as input, Poisson distribution, the extended diagonal 
model and amaximum number of change points of 50. To compensate 
for TADs of different sizes, the selected regions are resized before aver- 
aging the contact maps. These regions are comprised of the TAD itself 
anda flanking region of halfits size. We calculated the insulation score 
as previously described”. The insulation score was computed using the 
implementation of GENOVA, with a rolling window size of 25 kb. The 
insulation score was then aligned to TAD borders to create heat maps. 

For Extended Data Fig. 6, the compartment-score was calculated as 
previously described”. In brief, the compartment score is computed 
per chromosome arm by obtaining the first eigenvector of the observed 
over expected matrix, minus 1. Then, this eigenvector is multiplied by 
the square root of its eigenvalue to obtain the compartment score. 
To correctly orient the scores so that positive values correspond to 
compartment-A regions, we used the correlation of the compartment 
score to H3K4mel peaks in wild-type cells (J.H.1.H. et al., manuscript 
in preparation). 

For Extended Data Fig. 6e, we compared the effect of CTCF'7764"774 
mutation on genome organization to that of CTCF and cohesin degrada- 
tion”. Raw Hi-C data from Gene Expression Omnibus (GEO) accession 
GSE102884 were converted to HiC-Pro format and ICE-normalized. 
Relative contact probability profiles were generated using GENOVA. 


ChIP-seq 

Samples for ChIP-seq were prepared and sequenced as previously 
described”, with minor changes. The DNA was sheared using Biorupter 
Pico (Diagenode), 5 cycles of 15-s on and 90-s off. Reads were first 
trimmed using TrimGalore v.0.6.0°, then mapped to hg19 using Bow- 
tie2 v.2.3.4% with default settings. Bigwig files were generated with 
DeepTools v.3.1.3% with the following settings: minimum mapping 
quality of 15, bin length of 10 bp, extending reads to 200 bp and reads 
per kilobase per million reads normalization. 

Peaks were called for all samples using MACS2 v.2.1.1% with default 
options. Overlaps between the sets of identified peaks across samples 
were obtained using BEDtools v.2.25.0°. Heat maps were generated 
using DeepTools® for the different sets of peaks identified in the wild- 
type cell line, excluding those overlapping blacklisted regions of the 
genome®. 


CTCF sites shown in Hi-C contact matrices were obtained froma pre- 
vious publication”. In brief, these sites were generated by intersecting 
CTCF peaks with CTCF motifs from JASPAR CORE 2014”, using FIMO” 
to annotate their motif orientation. 


ChIP-qPCR 

ChIP-qPCR analysis was performed to assess SCC1 and CTCF abundance 
at specific genomic loci. SCC1 ChIP was performed three times, and on 
each ChIP three qPCRs were performed in duplicate. A representative 
qPCR analysis of each ChIP was used for quantification. For CTCF and 
IgG, two ChIPs were performed in duplicate. Reactions were performed 
using SYBR No-Rox Mix 2x (BIOLINE) and run on a LightCycler 480 II 
(Roche). C, values were determined for input and ChIP samples, and 
subsequently the AC, value was converted into a percentage of input. 
The primers are listed in Extended Data Table 3a. 


RNA sequencing 

Samples for RNA sequencing were prepared and sequenced as previ- 
ously described”. Reads were aligned to hg19 using TopHat v.2.1.1” and 
later counted with HTSeq v.0.11.1” using Gencode v.19 gene-build as 
reference. Differentially expressed genes were identified with DESeq2 
v.1.18.135”, with an adjusted Pvalue threshold of 0.05 and considering 
only protein-coding genes. 


Reporting summary 
Further information on research design is available in the Nature 
Research Reporting Summary linked to this paper. 


Data availability 


Coordinates are available from the PDB under accession number 6QNX 
for the SA2-SCC1-CTCF complex. The generated Hi-C, RNA sequenc- 
ing and ChIP-seq data have been deposited in GEO, accession number 
GSE126637. Any other relevant data are available from the correspond- 
ing authors upon reasonable request. 
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Extended Data Fig. 1| Biochemical analysis of CTCF binding to SA2-SCC1. 

a, Domain architecture of CTCF. CTCF fragments tested for SA2-SCC1 binding 
by GST pulldown analysis are indicated. The region that retains SA2-SCCL1is 
highlighted in magenta. b, Summary data showing results of GST pulldowns. 
The input and the bound fractions were analysed by SDS-PAGE. CTCF 
fragments that bind SA2-SCC1 are shown in magenta. The experiment was 


SA2 "SE 


D326 
a he 
Ks00 a 


ron 226 


CTCF 86-267 CTCF 222-231 
Cc Time (min) Time (min) 


ycal/sec 


N=0.77 


T T T T 
0.0 05 10 15 20 0.0 05, 


1.0 15 
Molar Ratio Molar Ratio 


KCal/Mole of Injectant 


20 


repeated once.c, ITC curves. The binding stoichiometry (N) and dissociation 
constants (K,) are indicated. The experiment was repeated three times, with 
consistent results. d, F,- F, omit electron-density Fourier map contoured at 30. 
e, LIGPLOT representation of the interaction between the CTCF peptide and 
SA2-SCC1. The CTCF peptide is shown in magenta, SA2 in blue and SCC1lin 
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Extended Data Fig. 2 | Analysis of the SA2-SCC1-CTCF structure. a, Multiple 
sequence alignment of SA2 (here denoted STAG2) orthologues and paralogues. 
*Key amino-acid residues that engage CTCF. b, Missense mutation frequencies 
plotted onto the SA2 structure. R370 (ahotspotin SA2) is indicated. Theinset 
shows an overview of the mutation hotspots R370 of SA2), Y226 and F228 of 
CTCF, and S334, K335, R338 and L341 of SCC).c, ITC progress curves of binding 
between WAPL(423-463) and SA2-SCC1. d, Competition between SGO1 and 
CTCF for SA2-SCC1 binding. SA2-SCC1 was incubated with GST-CTCF(86-267). 
Increasing amounts (lanes 4-8) (molar ratios are indicated) of the SGO1 
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phosphorylated at T346 peptide (spanning residues 331-349) were added and 
the input and the bound fraction analysed by SDS-PAGE. The experiment was 
repeated twice. One representative example is shown. e, Domain architecture 
and sequence alignments of cohesin regulators that contain F/YXF motifs. 
Putative CES-interacting residues are highlighted in red. f, Regular expression 
motif used to query the human and yeast proteomes for factors containing 
F/YXF motifs. Regular expression syntax: letters denote a specific amino acid; 
square brackets denotea subset of allowed amino acids; curly brackets denote 
length variability. 
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Extended Data Fig. 3| Generation of CTCF?" cells. a, Schematic of 
CRISPR-Cas9-based generation of CTCF?“ cells. The guide targets 
cleavage of exon 1of the CTCF gene. The repair oligonucleotide renders the 
gene noncleavable by Cas9, and simultaneously introduces mutations inthe 
codons that encode Y226 and F228. b, The CTCF?" mutation was 
confirmed by Sanger sequencing, including a silent mutation at position 229. 

c, Western blot depicting Halo-tagged SCC1 in wild-type and CTCF"? cells. 
The parental wild-type cells are included as acontrol. This experiment was 
performed once. d, Representative images of cells inGland G2, as indicated 

by their nuclear and cytoplasmic localization of DHB-iRFP, respectively. 
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e, Chromatin-bound levels of CTCF and SMC1analysed by western blot. Histone 
H4 is used asa control for the chromatin fraction. The CTCF" mutation 
does not evidently affect overall CTCF and cohesin levels on chromatin. WCE, 
whole-cell extract; CB, chromatin-bound fraction. This experiment was 
performed twice with similar results. f, Relative SCC1-Halo fluorescence 
intensity quantified in the unbleached area directly after photobleaching, asa 
proxy for the chromatin-bound fraction of SCC1. This nondiffusive fraction is 
not evidently affected by the CTCF?" mutation. Individual cells of three 
independent experiments are plotted as dots and their mean is indicated 

(21 wild-type cells and 17 CTCF'?"24 cells were scored). 
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boundaries. These contacts are referred to as CTCF-anchored loops. 

b, Aggregate TAD analysis depicting the average contact frequency across 
TADs defined in wild-type cells. c, Heat map of the insulation score” at TAD 
borders, as defined for wild-type cells. d, Aggregate peak analysis as in Fig. 3c, 
using two independent library preparations per genotype. e, Aggregate TAD 
analysis for wild-type and CTCF'?*°""24 cells as in b. f, Heat map of insulation 
scores at TAD borders for wild-type and CTCF???" cells as inc. 


Extended Data Fig. 4| TAD analyses and Hi-Creplicates. a, Schematic ofa 
Hi-C matrix displaying DNA-DNA contacts across a genomic region that 
includes two TADs. TADs in general are flanked by inwards-pointing CTCF sites 
(magenta arrows). Signal close to the diagonal line reflects short-range 
contacts, and contacts that span longer distances are found further away from 
the diagonal. The contacts within a TAD are formed by cohesin complexes (blue 
circles). Cohesin builds loops that it can enlarge until itencounters CTCF. Some 
TADs are enriched for contacts between the two CTCF sites that lie at their 
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Extended Data Fig. 5 | CTCF’??"2*4 mutation has little effect on CTCF levels 
at CTCF sites. a, Hi-C contact matrix of region chromosome 16: 77000000- 
78300000 at 10-kb resolution for the wild-type cell line (bottom triangles) and 
the CTCF'?76*"254 cell line (top triangles). CTCF sites are depicted below; those 
selected for qPCR are shown in colour. Red triangles indicate sites witha 
forward motifand blue triangles indicate sites with a reverse motif. The 
numbers underneath indicate the qPCR primer pairs shown in b. Primer pair 11 
(indicated with*) is at alocus devoid of SCCland CTCF. b, ChIP-qPCR analysis 
of SCC1 (cohesin) enrichment at the aforementioned CTCF sites and control 
locus (*) in wild-type and CTCF'""4 cells. The mean of three independent 


ChIP experiments is shown with the s.d.c, ChIP-seq tracks for SCCland CTCF 
at region chromosome 16: 77000000-78300000 in wild-type and CTCF” 
2284 cells. The loci used for ChIP-qPCR analysis are indicated below the SCC1 
ChIP-seq tracks. RPKM, reads per kilobase per million reads. d, ChIP-qPCR 
analysis of CTCF abundanceat loci 1-7, as described in Fig. 3d. Analysis includes 
IgG as acontrol. The mean of two independent ChIP experiments is shown. 
Details of replicates are given in the Methods. e, ChIP—qPCR analysis of CTCF 
abundanceat loci 8-12, as described in Extended Data Fig. 4a. Analysis includes 
IgG as acontrol. The mean of two independent ChIP experiments is shown. 
Details of replicates are givenin the extended methods. 
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Extended Data Fig. 6 |Compartmentalization is largely maintained in cells 
that contain the CTCF'?/"4 mutation. a, Hi-C contact matrices of the q-arm 
of chromosome 2 at 500-kb resolution. The corresponding compartment 
scores are plotted above. b, Genome-wide comparison of compartment scores 
for wild-type and CTCF" cells. Pearson correlation =0.97.c, Saddle plots 
representing the interaction between A and B compartments. d, Aregion of 


chromosome 1(55500000-59500000) at 10-kb resolution that contains no 
obvious CTCF-anchored loops. e, Relative contact probability profiles for wild- 
type and CTCF'?76*?754 mutant cells (left), compared to previously published” 
contact profiles upon degradation of CTCF (middle) or SCC1 (right). The 
contact probability profile is affected only slightly in the CTCF???" 
mutants, similar to the effects of CTCF depletion. 
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Extended Data Fig. 7 | Identification of CES ligands. a, Plot depicting the 
log,-transformed fold change in gene expression in relation to the mean of the 
normalized counts for each gene. Differentially expressed genes (adjusted 
Pvalue < 0.05, two-tailed Wald test adjusted for multiple testing using the 
Benjamini-Hochberg procedure) are shown inred. Gene names are included 


for the 40 genes with the highest fold change. b, Western blot assessing 


knockdown of CTCF and the cohesin subunit SMC1upon transfection witha 
control siRNA targeting luciferase (luc) or siRNAs targeting CTCF or SMCIA. 


This experiment was performed twice with similar results. c, Colony-formation 
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17. hHSYCP3 21 FTRAYDFETED 
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23. hZGPAT 411 PRNVFDFLNE 
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26 hMDC1 320 RAOPFGFIDSD 
27 hCHD6 2177 HRRPYEFEVER 
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assay of wild-type and CTCF’? cells upon transfection with a control 
siRNA targeting luciferase or siRNAs targeting CTCF or SMC1A. CTCF remains 


essential for viability in CTCF?" 


cells. This experiment was performed 


four times with similar results. d, Peptide array annotation (top left), binding of 
SA2-SCC1 (top right) or SA2(F371A)-SCC1 mutant (bottom left) and antibody 
control (bottom right). Three independent experiments were done, with 
consistent results. One representative example is shown. e, Amino acid 
sequences of the peptides. Predicted lead-anchoring residues are coloured 


red. 


Extended Data Table 1| Summary of ITC data, and X-ray data collection and refinement statistics 


Protein Residues Ka (uM) AH TAS AG Nt 
kcal/mol kcal/mol) kcal/mol) 

CTCF# 222-231 1.04+0.20 -11.08+0.70 -2.92+082 -8.16+0.09 0.93+0.04 

CTCF+ 86-267 0.62 -13.16 -4.61 -8.54 0.78 

Wapl+ 423-463 32.8 -6.66 -0.54 -6.11 0.62 

Waplt+ 447-462 78.7 -6.81 -1.20 -5.60 1.00 
Shugoshint 331-341 13.5 -10.67 -4.02 -6.64 0.83 
Shugoshint 331-349 2.32 -20.00 -12.30 -7.69 0.89 

(pT346)§ 


SA2-Scc1-CTCF 


PDB 6QNX 
Data collection 
Space group P24212, 
Cell dimensions 
a, b, c(A) 79.02, 107.25, 176.49 

Resolution (A) 45.81-2.70% 
Reym OF Rimerge 6.9 (175)* 
I/ol 12.0 (0.8)* 
CC 1/2 0.99 (0.33)* 
Completeness (%) 99.6 (99.7)* 
Redundancy 4.4 (4.3)* 
Refinement 
Resolution (A) 45.81-2.70 
Rawork / Fire 0.25 / 0.27 
No. reflections 46759 
No. atoms 16487 

SA2 15088 

Scc1 1235 

Ligand 140ctcr 
B-factors (mean; A2) 

SA2 133.4 

Scc1 111.3 

Ligand 143.6 ctcr 


R.m.s deviations 
Bond lengths (A) 0.004 
Bond angles (°) 0.53 


a, ITC data summary. b, X-ray data collection and refinement statistics. 

#Three independent experiments were performed. The mean values + s.d. are shown. 
tExperiment was performed once. 

+Binding stoichiometry. 

§pT346, phosphothreonine. 

“Data derived from one crystal. 

*Values in parentheses are for the highest-resolution shell. 
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Extended Data Table 2 | Quantification of peptide arrays 


Position Protein species Mutation Uniprot Sequence Ka [uM] 


1 CTCF Human Wild-type P49711 222 DVSVYDFEEEQ 1.04+0.16* 
2 CTCF Human Y226F P49711 222 DVSVFDFEEEQ 1.60 +0.03+ 
3 CTCF Human Y226A P49711 222 DVSVADFEEEQ n.b. 
4 CTCF Human F228A P49711 222 DVSVYDAEEEQ n.b. 
5 CTCF Human Y226/F228A  P49711 222 DVSVADAEEEQ n.b. 
6 WAPL Human Wild-type Q7Z5K2 70 TGDPFGFDSDD 5.90 + 0.95+ 
4 WAPL Human Wild-type Q7Z5K2 425 KLEFFGFEDHE 12.43+7.72+ 
8 WAPL Human Wild-type Q7Z5K2 450 KIKYFGFDDLS 2.17+0.16+ 
9 WAPL Human Wild-type Q7Z5kK2 557 ARHWNHPDSEE ND 
10 WAPL Yeast Wild-type Q99359 108 KLSAFNFLDGS 10.65 + 3.83+ 
11 CDCA5 = Human Wild-type Q96FF9 162 RRSCFGFEGLL ND 
12 SGO1 Human Wild-type Q5FBB7 332 SNDAYNFNLEE n.b. 
13 MCM3 Human Wild-type P25205 704 SYDPYDFSDTE 1.02 +0.05+ 
14 POGZ Human Wild-type Q7Z3K3 1398 TESFYGFEEAD n.b. 
15 SYCP1 Human Wild-type Q15431 882 RKMAFEFDINS 6.71 +3.187 
16 SYCP2 Human Wild-type Q9BX26 867 VNDVYNFNLNG n.b. 
17 SYCP3 Human Wild-type Q8IZU3 21 FTRAYDFETED 1.96 + 0.267 
18 Scci Yeast Wild-type Q12158 278 EPTDFGFDLDI n.b. 
19 scc2 Human Wild-type Q6KC79 1439 TAVFSRYEKHR ND 
20 scc2 Yeast Wild-type Q04002 400 VSLFGSFDQOR n.b. 
21 BTBD7 Human Wild-type Q9P203 942 YPDFYDFSNAA n.b. 
22 CENPU = Human Wild-type Q71F23 40 PIDVFDFPDNS 4.03+0.22+ 
23 ZGPAT ~—-Human Wild-type Q8N5A5 411 PRNVFDFLNEK 4,03+1.19+ 
24 ZFHX4 Human Wild-type Q86UP3 2144 KDSPYNFSNPP n.b. 
25 ZFHX3 Human Wild-type Q15911 2205 KDSPYNFSNPP n.b. 
26 MDC1 Human Wild-type Q14676 320 RAQPFGFIDSD n.b. 
27 CHD6 Human Wild-type Q8TD26 2177 HRRPYEFEVER ND 
28 TFIIH Human Wild-type Q13888 245 PMDLFDFYEQM n.b. 


Peptide spot signal intensities were correlated to the K, of CTCF wild type, thus yielding a semiquantitative binding assay”. Data points are indicated as mean +s.d. n.b., no apparent binding. 
ND, not determined owing to nonspecific binding of the anti-6xhistidine antibody. 

*Value for CTCF wild type, based on ITC measurement shown in Extended Data Table 1a. The mean values + s.d. are shown. 

tApparent K, determined on the basis of three independent peptide array experiments. 


Extended Data Table 3 | Primers and Hi-C statistics 


Primerset Primer orientation Sequence 
Primerpair 1 Forward GGCACTACAGGACCACGTTT 
Reverse CCCAATTGTGTCTGCCTTITT 
Primerpair 2 Forward GTGGTGTGGGGAAGAGTGTT 
Reverse GTCAGCTAAACGCCCAGGTA 
Primerpair 3 Forward CAAGTTTTCCACCCGCTTTA 
Reverse GAGCCCTAACACCACTCCAC 
Primerpair 4 Forward GGCTTGGAACTGTTGGTCAT 
Reverse AGATGGCAGCAGCTTTTCAT 
Primerpair 5 Forward TGATTGTGTACAACAGCTGCAA 
Reverse ATTTTTAGGTGCCTCGCAGT 
Primerpair 6 Forward CTGAGCCTCCTGCAAAAGTT 
Reverse CTCTTCTTCGCTCCAGCACT 
Primerpair 7 Forward ACTGCAGCCTCAGCTACCTC 
Reverse TTTATTGGCATTGCCTCCTC 
Primerpair 8 Forward CAGTCCTTGTGGCTCCTAGC 
Reverse TCTGGTGTGCCCTGAACATA 
Primerpair 9 Forward CACCTTGTGGACAGTGGTTG 
Reverse AGCCTGTGAAACAGGGTGAG 
Primerpair 10 Forward TACACGGGTGGCTAAAGGAG 
Reverse AGCCAGCCAGATGTCAAACT 
Primerpair 11 Forward CATGCCCAGCCAATTATTTT 
Reverse CTCTCCTCCACTTCCCCATT 
Primerpair 12 Forward CACTTTTCCGACCCAGAAGA 
Reverse GGCCTGGAGAACTCAAACTG 
Genotype Replicate Total Pairs Valid Pairs Cis% Cis<20kb Cis>20kb Cis ratio 
Wild type 1 61118122 60166198 47100811 =78,28 7085049 40015762 5,65. 
Wild type 21 62631817 61755440 48127333 77,93 7114243 41043090 5,76 
Wild type 2.2 190892790 152708260 122528381 80,24 18087008 104441373 5,77 
CTCF Y226A F228A 1 63339779 62197640 47164621 75,83 72900092 39874529 5,47 
CTCF Y226A F228A 2.1 62326840 61227593 46569997 76,06 7419962 39150035 5,28 
CTCF Y226A F228A 2.2 148586127 118816672 93071165 78,33 14814014 78257151 5,28 
Genotype Replicate Total Valid Cis% Cis < Cis > 20kb Cis 
Pairs Pairs 20kb ratio 
Wild type 142.142.2 312814428 270823907 217755919 80,40 32286097 185469852 5,74 
CTCF Y226A F228A 142.14+2.2 272011360 238342505 186805272 78,38 29523837 157281435 5,33 


a, Primers. b, Hi-C statistics for replicate library preparations. Libraries 1 and 2 are independent preparations; 2.2 is a deeper resequencing of sample 2.1. The independent libraries 1 and 2.1 were 
used for Extended Data Fig. 4. A merge of replicates 1, 2.1 and 2.2 of the wild-type cells, and a merge of replicates 1, 2.1 and 2.2 of CTCF'”*°*"?"4 mutant cells, was used for Figs. 3, 4a, Extended 
Data Figs. 5a, 6. ¢, Hi-C statistics after merging replicates of wild-type and CTCF?" libraries. 


i L At { { i researc | } Corresponding author(s): D.Panne, B. Rowland, K. Muir, E. De Wit 


Last updated by author(s): Nov 1, 2019 


Reporting Summary 


Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency 
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist. 


Statistics 


= 
fev) 
a 
e 
= 
o 
= 
o 
Za) 
© 
fev) 
= 
a 
= 
= 
o 
so) 
Oo 
Ee 
= 
a 
Za) 
S 
= 
= 
fev) 
5 
< 


For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. 


n/a | Confirmed 


The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement 


A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly 


i The statistical test(s) used AND whether they are one- or two-sided 
Only common tests should be described solely by name; describe more complex techniques in the Methods section. 


A description of all covariates tested 


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons 


A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) 
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) 


O For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted 
Give P values as exact values whenever suitable. 


For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings 


[| For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes 


Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated 


Our web collection on statistics for biologists contains articles on many of the points above. 


Software and code 


Policy information about availability of computer code 


Data collection XCuBE2, XDS (20180808), UNICORN 6, LAS-AF FRAP-Wizard 2.7.4.10100 


Data analysis olecular replacement was done with Phaser (Phenix 1.14-3260). 
Hi-C sequencing data was processed with HiC-Pro 2.9. Replicates similarity was assessed with HiCRep 1.8.0. Hi-C data analysis was 
performed with GENOVA 0.9.98 (github.com/deWitLab/GENOVA). HiC-Pro output was converted to juicer files using juicebox-pre (juicer 
tools 0.7.5). We performed loop calling with HICCUPs 1.11 on juicer files. 
ChiPseq reads were trimmed with TrimGalore 0.6.0. Mapping of ChIPseq data was performed with bowtie 2.3.4.130 to hg19. We 
erformed peak calling with MACS2 2.1.131. Overlaps between sets of identified peaks across samples were obtained using BEDtools 
25.0. ChIPseq alignment plots were created with deeptools 3.1.3. 

Aseq data was mapped with TopHat 2.1.133 and count-tables were generated with HTSeq 0.11.1 with the stranded=reverse setting 
sing the Gencode v19 gene-build. Differential expression analysis was performed with DESeq2 1.18.135. 
uorescence intensity in FRAP experiments was measured using ImageJ 1.52q. 
Structure refinement was done with Phenix (1.14-3260), Structure building was done with COOT 0.8.0-3, Structure renderings were done 
with Pymol (2.2.3), Structure analysis was done with MolProbity (4.3), Gel band quantification was done with imageJ (1.8.0_112), ITC 
data were analyzed with Origin 7.0 


moDNTD 


For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. 
We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. 


Data 


Policy information about availability of data 
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: 


- Accession codes, unique identifiers, or web links for publicly available datasets 
- A list of figures that have associated raw data 
- Adescription of any restrictions on data availability 


Coordinates are available from the Protein Data Bank under accession number 6QNX for the SA2-SCC1-CTCF complex. The generated Hi-C, RNA-Seq and ChIP-Seq 
data has been deposited in GEO (accession number GSE126637). The current status of these entries is HPUB (‘Hold for Publication’) which indicates that they are 
released when the article is published. We will instruct the data depositories to make these entries publicly accessible prior to publication. 


Field-specific reporting 


Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. 


x] Life sciences Behavioural & social sciences [| Ecological, evolutionary & environmental sciences 


For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf 


Life sciences study design 


All studies must disclose on these points even when the disclosure is negative. 


Sample size No statistical methods were used to predetermine sample size. We have added comments on the sample sizes in the legends. For Figures 
showing GST-pulldown analyses (Fig.1, Extended Data Fig. 1), appropriate controls are used to compare binding side-by-side, as is customary. 
Single replicates are sufficient in this case. For assays measuring dissociation constants (Extended Data Fig.1, Extended Data Table 1 and 3), 
three independent experiments were performed as required for determination of measurement errors. Hi-C library preps was performed in 
duplicate. These replicates showed high similarity and were subsequently merged. RNAseq libraries were 
generated in triplicate. SCC1 ChIP experiments were performed in triplicate, and CTCF ChIPs in duplicate. These were all analysed by 
ChIPqPCRs, and a representative of each ChIP was analysed by ChIP-Seq. FRAP was performed on 21 wild type cells, and on 17 CTCF Y226A 
F228A cells. 


Data exclusions No data was excluded in our analyses. 


Replication We have indicated the number of repeat measurements made and consistency of the results obtained in the figure legends. RNAseq 
experiments were performed in triplicate. Hi-C was performed in duplicate and data was later pooled together. SCC1 ChIPs were performed in 
triplicate, and CTCF ChIPs in duplicate. FRAP was performed in three in 
dependent experiments. All attempts at replication were successful. 


Randomization Randomization is not relevant to this study, as protein and crystal samples are not required to be allocated into experimental groups. No 
animals or human research participants are involved in this study. 


Blinding Blinding is not relevant to this study, as protein and crystal samples are not required to be allocated into experimental groups in protein 
structural studies. No animals or human research participants are involved in this study. 


Reporting for specific materials, systems and methods 


We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, 
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. 


Materials & experimental systems Methods 

n/a | Involved in the study n/a | Involved in the study 
Antibodies ChIP-seq 
Eukaryotic cell lines Flow cytometry 
Palaeontology MRI-based neuroimaging 


Animals and other organisms 


Human research participants 


Clinical data 


Antibodies 


Antibodies used His-HRP antibody (Sigma A-7058 lot number O88M4865V) was used for analysis of peptide arrays at a dilution of 1:2000. 
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The following primary antibodies were used for ChIP: 

SCC1: Abcam ab992 lot.GR3253930-2 and lot.GR3253930-3 5 ug per ChIP 

CTCF: Cell Signaling 3418S 5 ug per ChIP 

IgG: Sigma-Aldrich 15006 5 ug per ChIP 

For Western blot, the following antibodies were used: 

SMC1: Bethyl A300-055A 6 dilution 1:1000 

SCC1: Millipore 05-908 lot.3055582 dilution 1:1000 

CTCF: Millipore 07-729 lot.3059608 and Abcam ab128873 dilution 1:1000 

HSP90: Santa Cruz F-8 #10518 dilution 1:10.000 

H4: Millipore 05-858 dilution 1:1000 

Tubulin: Sigma T5168 lot.047M4760V dilution 1:10.000 

Goat anti-Rabbit: DAKO P0447 lot.20046248 dilution 1:2000 

Goat anti-mouse: DAKO P0448 lot.20053537 dilution 1:2000 
Validation ps://www.abcam.com/rad21-antibody-chip-grade-ab992.html 
ps://www.cellsignal.com/products/primary-antibodies/ctcf-d31h2-xp-rabbit-mab/3418 
ps://www.sigmaaldrich.com/catalog/product/sigma/i5006 
ps://www.bethyl.com/product/A300-055A/SMC1+Antibody 
ps://www.emdmillipore.com/INTL/en/product/Anti-RAD21-Antibody, MM_NF-05-908 
p://www.merckmillipore.com/INTL/en/product/Anti-CTCF-Antibody, MM_NF-07-729 
ps://www.abcam.com/ctcf-antibody-epr7314b-ab128873.html 
ps://www.scbt.com/scbt/product/rapgef6-antibody-f-8 
ps://www.merckmillipore.com/INTL/en/product/Anti-Histone-H4-Antibody-pan-clone-62-141-13-rabbit- 
onoclonal, MM_NF-05-858 
ps://www.sigmaaldrich.com/catalog/product/sigma/t5168 
ps://www.agilent.com/store/en_US/Prod-P044701-2/P044701-2 
ps://www.agilent.com/store/en_US/Prod-P044801-2/P044801-2 
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SOUDOIT3 32> S> 35535535552 


Eukaryotic cell lines 


Policy information about cell lines 


Cell line source(s) HAP1 wild type cells from Carette et al., Nature 2011 a gift from the authors. 
HAP1 CTCF Y226A F228A generated in this study in HAP1 wild type background cells using CRISPR/Cas gene editing. 


Authentication Karyotyping. Mutants were confirmed by Sanger sequencing. 
Mycoplasma contamination All cell lines were negative for mycoplasma contamination. 


Commonly misidentified lines No commonly misidentified line was used. 
(See ICLAC register) 


ChIP-seq 


Data deposition 


Confirm that both raw and final processed data have been deposited in a public database such as GEO. 


Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks. 


Data access links Go to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126637 
May remain private before publication. Enter token qzshmsyavvcndmj into the box 
Files in database submission GSM4052950_SCC1_ChIPseq_WT_LOO1.fastq.gz 


GSM4052950_SCC1_ChIPseq_WT_LO02.fastq.gz 
GSM4052950_SCC1_ChIPseq_WT_peaks.narrowPeak.gz 
GSM4052951_SCC1_ChIPseq_CTCF.Y226A.F228A_LOO1.fastq.gz 
GSM4052951_SCC1_ChIPseq_CTCF.Y226A.F228A_L002.fastq.gz 
GSM4052951_SCC1_ChIPseq_CTCF.Y226A.F228A_peaks.narrowPeak.gz 
GSM4052952_CTCF_ChIPseq_WT_LOO1.fastq.gz 
GSM4052952_CTCF_ChIPseq_WT_peaks.narrowPeak.gz 
GSM4052953_CTCF_ChIPseq_CTCF.Y226A.F228A_LOO1.fastq.gz 
GSM4052953_CTCF_ChIPseq_CTCF.Y226A.F228A_peaks.narrowPeak.gz 


Genome browser session https://genome.ucsc.edu/s/asedeno/CTCF_Y226A_F228A_HAP1 
(e.g. UCSC) 
Methodology 
Replicates SCC1 ChIP experiments were performed in triplicate, and CTCF ChIPs in duplicate. These were all analysed by ChIP-qPCRs, 


and a representative of each ChIP was analysed by ChIP-Seq 


Sequencing depth sample total_reads uniquely_mapped length type 
5512_11_SCC1_WT_CCGTCC_S25_R1_001 30249392 29761787 65 single 


Antibodies 


Peak calling parameters 


Data quality 


Software 


5512_12 SCC1_CTCFmut_GTGAAA S26 R1_001 30623586 30175451 65 single 
5588 5 CTCF_WT_2_GCCAAT S146 R1_001 54584946 53634127 65 single 
5588_6 CTCF_CTCF103_2 CAGATC_S147_R1_001 54586898 53657421 65 single 


SCC1: Abcam ab992 lot.GR3253930-2 and lot.GR3253930-3 
CTCF: Cell Signaling 3418S 


We performed peak calling with MACS2 2.1.131 for SMC1 and CTCF with standard settings. 


sample >-log10(0.05) >5FC 
5512_11_SCC1_WT_CCGTCC_S25_R1_001_peaks.narrowPeak 52350 29930 
5512_12_SCC1_CTCFmut_GTGAAA_S26_R1_001_peaks.narrowPeak 47710 19299 
5588_5_CTCF_WT_2_GCCAAT_S146_R1_001_peaks.narrowPeak 71900 57656 
5588 _6 CTCF_CTCF103_2_ CAGATC_S147_R1_001_peaks.narrowPeak 84009 67305 


We performed peak calling with MACS2 2.1.131 for SMC1 and CTCF with standard settings. 


=) 
jad) 
er 
S 
= 
o 
= 
o 
Za) 
© 
fav) 
= 
fa 
= 
= 
o 
50) 
e) 
me 
= 
a 
a) 
S 
= 
= 
fev) 
5 
< 


8L0C 4240190 


ADAPTED FROM GETTY 


Advice, technology and tools 


Work 


Send your careers story 
to: naturecareerseditor 
@nature.com 


AFTER THE FALL: WHAT TO DO 
WHEN YOUR GRANT IS REJECTED 


Failed funding applications are inevitable, but perseverance 
can pay dividends. By James Mitchell Crow 


he day after she submitted a grant 
proposal last November, Sarah 
McNaughton listed all the tactics she 

could think of to boost her chances 

of success next time. “I expect to be 
rejected,” says McNaughton. “It is the excep- 
tion to get funded, not the rule.” Publishing key 
papers and forging new collaborations were 
onher list, as was collecting preliminary data. 
McNaughton, a nutrition researcher at 
Deakin University in Melbourne, Australia, 
studies dietary patterns to find ways to 
improve public health. For the next phase of 
her work, she wants volunteers to use weara- 
ble cameras to capture what influences their 


food choices in real life, so she can determine 
how those vary depending ona person’s nutri- 
tion knowledge and cooking skills. 

After McNaughton had sent off her grant 
application to Australia’s National Health and 
Medical Research Council (NHMRC), top of 
her to-do list was launching a pilot study. “If 
we can show that people will wear the cam- 
eras, and they capture the data we need, that 
would really strengthen the application,” she 
says. 

A good idea is no guarantee of grant 
success. At the US National Science Founda- 
tion (NSF) in 2017 — the most recent year for 
which data are available — proposals worth 
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a total of almost US$4 billion were rejected 
simply because they were beyond the organ- 
ization’s budget, even though reviewers had 
rated them as very good or excellent. At the 
US National Institutes of Health, the aggre- 
gate success rate for research grants was 
20.5% in 2017 (the most recent data available). 
At the biomedical-research funder Wellcome 
in London, roughly 50% of applications make 
it through the preliminary stage. Of those, 
around 20% were funded in 2017-18. And 
the NHMRC Investigator Grant category that 
McNaughton applied for had a success rate of 
just 7% in the previous round in 2019. 
“Given the low success rates of funding 
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More on rejection 
recovery 


It's painful when your grant application 
is rejected, but here are some further 
thoughts on helping you to work 
productively after you've recovered 
from your disappointment. 


- You're not alone. Average success rates 
are around 20% among large funders, 

so grant rejection is common. “Don’t 
lose heart,” says Shahid Jameel, chief 
executive of IndiaAlliance, a biomedical- 
research funder in New Delhi and 
Hyderabad. Rejection doesn’t mean that 
your work is flawed. 


* Give yourself time. Allow a week or so 
to recover, says Candace Hassall, head 
of researcher affairs at the biomedical 
funder Wellcome in London. “When 
people are turned down, they are angry 
and upset. Let that play out,” she says. 
Put the application to one side for a few 
days before you consider your next steps. 


+ Share your setback. Discussing the grant 
rejection with colleagues, mentors and 
others can provide emotional support in 
the short term, and give you constructive 
feedback to help you to reapply for the 
grant when you are ready. “People whose 
grants have been rejected might not 

want to tell anybody, but getting advice 
and input can really help,” says Karen 
Noble, head of research careers at Cancer 
Research UK, which funds scientists and 
health-care professionals working on 
cancer treatments. 


+ Look for ways to improve. Tackling the 
concerns of the reviewers who rejected 
your grant is important. “But don’t assume 
that just by addressing the issues outlined, 
you will necessarily be successful next 
time,” says Noble. It is unlikely that the same 
reviewers will see your application again, 
so look at it holistically and strengthen 

it for the next round. This might involve 
incorporating key new data, learning 

a crucial technique or forming a fresh 
collaboration. 


* Get feedback. Your revision needs 
review by a broad, diverse group of 
people, including colleagues, mentors 
and members of your network. You should 
also circulate the revision to scientists 
who don’t specialize in your field. 
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around the world, the odds are stacked against 
you in winning that one proposal,” says Drew 
Evans, an energy researcher at the University 
of South Australia in Adelaide, and former 
deputy chair of the Australian Early- and 
Mid-Career Researcher Forum. “Work towards 
a portfolio of activities,” he says. Aiming for 
different strands of funding to cover various 
aspects of a researcher’s work is a safer bet 
than seeking one major grant, he adds. 

McNaughton applies the same strategy to 
any research for which she is seeking fund- 
ing. “I think about how I can split it up and 
target it to other organizations,” she says. It’s 
the first step towards applying to different 
funders without having to start from scratch 
each time — and you can work on it while 
waiting for the outcome of one application. 
“Rather than writing eight different grants, 
you are building an area — calling onthesame 
literature and on your same publications,” 
McNaughton says. 

Planning for rejection is a crucial part of the 
granting process, say those who have been 
through the wringer (see ‘More on rejection 
recovery’). The limited pot of research funds 
worldwide means that competition is fierce. 
“We receive many more proposals — many 
more very good proposals — than we can pos- 
sibly fund,” says Dawn Tilbury, a mechanical 
engineer at the University of Michigan in Ann 
Arbor who is head of the NSF Engineering 
Directorate, which funds basic research in 
science and engineering. 


Rejection hurts 


Rejection can be a bruising experience, say 
veteran grant-writers, and applicants need 
to give themselves at least a week to get 
through the initial pain. “Take a deep breath, 
close your computer, go home. Talk to your 
partner, or pet your cat,” says Tilbury. It’s a 
rollercoaster that Evans has ridden plenty of 
times. “You go through the various stages of 
emotions — anger, disappointment, despair, 
grieving almost,” he says. “Having time to 
digest, to get upset and angry — you need to 
go through that process, because you need a 
clear mind to come back to it constructively.” 

But grant-seekers can develop tricks to 
handle rejection better, says McNaughton. 
“Part of the reason I make a to-do list is to 
pull back my expectations,” she says. “Once it 
might have taken mea week or two to bounce 
back. Now, it’s 24 hours.” 

During the emotional recalibration pro- 
cess, researchers should share the setback 
with others, including colleagues and other 
professional contacts, says Evans. “It is your 
network that is going to give you the support 
and encouragement to keep going,” he says. 
Peers and mentors can help to put the rejec- 
tion into context. They might also know of 
other funding opportunities that can help 
to bridge an immediate financial shortfall, or 


© 2020 Springer Nature Limited. All rights reserved. 


of potential collaborators who might be able 
to bring a researcher into a larger funding 
opportunity. 


Ask the funder 


After working through the emotional 
component, applicants should next seek 
feedback from the granting organization. The 
level of feedback sent out with rejection letters 
varies drastically, depending on the organiza- 
tion, the scheme applied for and the stage the 
application reached before rejection. 

For smaller funders, feedback might not be 
provided as a matter of course. “That takes a 
bit of effort to put together,” says Kristina 
Elvidge, research manager at the Sanfilippo 
Children’s Foundation in Australia. The charity, 
based near Sydney, funds up to Aus$700,000 
(around US$472,000) annually on research 
into treatments for the rare genetic disorder 
Sanfilippo syndrome, which causes fatal brain 
damage. 

“always give feedback to rejected appli- 
cants if they ask — but they very rarely do,” 
Elvidge says. For researchers whose work 
might align closely with the mission of asmall 
foundation, seeking feedback can be the first 
step in starting a dialogue and building a rela- 
tionship with a potential long-term funder. 
Megan Donnell, the foundation’s executive 
director and founder, says that the organiza- 
tion welcomes such efforts. 

For applicants to a larger organization or 
agency, such as the NSF, a rejection typically 
comes with some feedback — but that doesn’t 
mean the researcher can’t seek more, Tilbury 
says. “The programme director might be able 
to fillin some of the blanks,” she says. The feed- 
back cancontain many comments, criticisms 
and suggestions, and often the grant reviewers 
donot agree with each other. The programme 
director can help the applicant to peel away 
superficial concerns and make sure that she 
or he understands the proposal’s underlying 
weaknesses so as to address them ina potential 
revision, Tilbury says. “It’s one of the things 
programme directors enjoy doing — mentor- 
ing junior faculty members and trying to help 
them in their grant writing.” 

Some funders will not have the resources to 
provide feedback. But researchers should not 
fear tainting their reputation if they ask, says 
Candace Hassall, head of researcher affairs at 
Wellcome. “A funding agency won't think badly 
of anyone contacting them for advice, even if 
we can’t give it.” 


Get feedback on the feedback 


Once a researcher has gathered constructive 
criticism, he or she should candidly appraise 
the strengths and weaknesses of their applica- 
tion. It is important to avoid taking feedback 
personally, says Shahid Jameel, chief execu- 
tive of IndiaAlliance, a large research funder 
in New Delhi and Hyderabad. It supports 
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Discussing grant rejections with peers can help to put them into context, advises Drew Evans 
(left), shown talking to early-career researcher Nasim Amiralian. 


biomedical and health research in India, and 
is itself funded by Wellcome and the Indian 
government’s Department of Biotechnology. 
“You have to get out of this mindset that there 
is either something wrong with you, or that 
people are against you,” Jameel says. “Review- 
ers really want you to do well — that is why they 
are spending their time reviewing your grant 
and providing feedback.” 

Reviewer feedback often seems less 
negative over time, McNaughton says. “I often 
colour code my reviewers’ comments — green 
for good and red for bad — and then realize 
that actually, there are alot of good things in 
there as well,” she says. “These little things can 
make the process a bit easier.” And getting 
reviewer feedback is certainly preferable to 
not getting any, she adds. For her most recent 
rejection, she received only numerical scores 
on various components of her grant. “Then it 
is very difficult to know how to improve the 
application,” she says. 

Unsuccessful applicants should also seek 
input from colleagues and others whose opin- 
ions they value. “Talk to your peer group and 
your mentors — they will have been through 
the process and they can help you interpret 
the letter,’ says Karen Noble, head of research 
careers at Cancer Research UK in London, 
which funds work on cancer treatments. 
Researchers can ask colleagues whether they 
agree with the feedback, whether they think 
that the reviewers missed an important point 
because it was not fully explained in the pro- 
posal, or whether they consider the proposal’s 
argument to be flawed. 

Researchers also need to determine whether 
they should reapply to the same funding 
scheme or seek alternatives. If an application 
fell at the first round of screening — in which 


reviewers assess the overall suitability of an 
applicant and proposal for that particular 
scheme — an alternative funder could be a 
better fit. For example, some government-sup- 
ported agencies, such as the NSF, give grants 
for only basic research, whereas others, such 
as the US Department of Energy, are mis- 
sion-focused and fund more-applied projects. 
“It is also important to consider funders that 
are not in one’s own nation,” says Jameel. 


“Industry partnerships are 
now one of the hot topics 
around the water cooler.” 


Grant-writers should keep industrial 
funders in mind, Evans says. He notes that 
applicants might be able to reshape a proposal 
to showits value to a particular business, add- 
ing that scientists who engage with businesses 
can diversify their grant portfolio and boost 
the resilience of their research income stream. 
Exploring potential applications of one’s work 
toindustry could keep aresearcher going until 
the next round of funding agency grants. 
“Industry partnerships are now one of the 
hot topics around the water cooler,” he says. 


Nailing the details 

Rejection also lurks after the preliminary 
screening stage when a grant application 
enters peer review. “If there’s a particular 
approach the reviewers don’t like, sometimes 
you may just need to explain it better — but 
sometimes there’s a mismatch,’ Tilbury says. 
She adds that many early-career scientists seek 
to apply a technique or expertise they honed 
during a postdoc to a new area of research. 
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If the reviewers weren't sold on the idea, the 
grant-writer needs to think carefully about 
the proposal, Tilbury says. “Are the reviewers 
right? Am] using the wrong hammer to pound 
this nail?” 

If a grant-seeker is certain that their pro- 
posal — and their expertise — do fit the grant 
scheme, they need to make that clear to 
reviewers. “A common reason for rejection 
is that the applicant has made assumptions 
about what the reviewers know about them,” 
Hassall says. “Ifa technique or method is crit- 
ical to what you are proposing, you have to 
include it. Make it easy for people to get the 
information that they need.” 

Similarly, if referees rejected agrant because 
the applicant had no experience in a particu- 
lar technique, it pays to get it and include that 
information in the next round. Before reap- 
plying, researchers should seek collaborators 
who are experts in that area or technique, or 
spend a week working in the collaborator’s lab 
to gain experience. 

It is the applications that just miss out on 
funding that can be the hardest to revise, 
Noble says. “Sometimes there wasn’t anything 
inherently wrong with somebody’s applica- 
tion. It just didn’t make it to the top of the list. 
Those can be the harder ones to try to repack- 
age and put in again.” 

Yet perseverance is key, says Mariane 
Krause, a psychologist at the Pontifical 
Catholic University of Chile, and president of 
the National Commission for Scientific and 
Technological Research (CONICYT) in Chile, 
which funds research in the country. She 
encourages researchers to refine their appli- 
cations and continue to apply. “I have many 
young researchers who get a grant the third 
time,” she says. 

Reapplying to the same organization for 
funding can work if the funder allows it. “The 
success rate of reapplications is significantly 
higher than for first-time applications,” says 
Alex Martin Hobdey, head of the unit at the 
European Research Council (ERC) that coordi- 
nates project calls and follow-ups. For exam- 
ple, new applicants to ERC grants havea 9-10% 
success rate. “For people reapplying, the suc- 
cess rate goes up to 14 to 15%. We have people 
who got their first grant on their seventh appli- 
cation,” he adds (see go.nature.com/2vrfugk). 

Some schemes impose a specific hiatus 
period before accepting reapplications, or 
have an annual or biannual application dead- 
line. Others, including Cancer Research UK, 
don’t impose specific limits. But programme 
officers recommend resisting the temptation 
to rush ina revised application as quickly as 
possible. “Take time — don’t knee-jerk,” Noble 
says. “Will you really be ina better position to 
reapply ina month?” 


James Mitchell Crow is a freelance writer in 
Melbourne, Australia. 
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samarine ecologist at the University 
of Tasmania in Hobart, Australia, I 
make about 150 dives a year, looking 
for threatened marine species. 

I focus on animals and plants that 
go largely unnoticed: small crustaceans and 
fish species such as gobies and blennies that 
grow 3 or 4centimetres long. I’m trying to 
illuminate the complex interactions between 
marine species and to understand how 
human activities disrupt that process. 

This photo was taken in February 2018 
at Elizabeth and Middleton Reefs in the 
southern Coral Sea Islands. I was counting 
species of fishes and invertebrates, and 
assessing changes since my last visit in 2013. 
The silver schooling fish behind me are 
Pseudocaranx georgianus, but I've also seen 
whale sharks, humpback whales, dolphins 
and sea snakes. I love the sense of peace 
underwater: all the stresses and problems 
of the terrestrial world drop away. 

The Reef Life Survey Foundation in 
Hobart, of which 1am president, uses trained 
volunteer scuba divers to do underwater 
surveys of biodiversity on rocky and coral 
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reefs worldwide. The project will provide an 
irreplaceable record of underwater life on 
Earthas it is now. We have data from more 
than 3,000 sites across 53 nations. The 
work is extremely important because many 
changes are happening underwater, out 

of sight. The Great Barrier Reef and Coral 
Sea reefs are threatened by coral bleaching 
associated with climate change, cyclones, 
outbreaks of crown-of-thorns seastars 
(Acanthaster planci) and fishing. 

This is the best of jobs in terms of seeing 
fantastic places that no one has been to 
before, and knowing that you're recording 
information needed for dealing with threats 
to marine biodiversity. But it’s depressing to 
return to places that were once so beautiful 
to find that climate change, or sediment or 
pollution, has largely destroyed them. You 
can’t help but reflect on the damage we're 
doing, and on what a poor state we’re leaving 
Earth in for future generations. 


Graham Edgar is a senior marine ecologist at 
the University of Tasmania in Hobart, Australia. 
Interview by Josie Glausiusz. 


